Table of Contents

Ubuntu - GPU - Troubleshooting - *ERROR* ring gfx_0.0.0 timeout

Random crashes when using a browser or icaclient (Citrix client).

[   85.861734] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=13365, emitted seq=13367 
[   85.862162] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_x11 pid 819 thread kwin_x11:cs0 pid 838

Fix

It is often a power saving issue.

echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level

An alternative workaround

Use the amdgpu.ppfeaturemask parameter to narrow down which power feature is causing problems.

cat /sys/class/drm/card0/device/pp_features

returns:

features high: 0x0003ebb8 low: 0x71ffffff
No. Feature               Bit : State
00. FW_DATA_READ         ( 0) : enabled
01. DPM_GFXCLK           ( 1) : enabled
02. DPM_GFX_POWER_OPTIMIZER ( 2) : enabled
03. DPM_UCLK             ( 3) : enabled
04. DPM_FCLK             ( 4) : enabled
05. DPM_SOCCLK           ( 5) : enabled
06. DPM_MP0CLK           ( 6) : enabled
07. DPM_LINK             ( 7) : enabled
08. DPM_DCN              ( 8) : enabled
09. VMEMP_SCALING        ( 9) : enabled
10. VDDIO_MEM_SCALING    (10) : enabled
11. DS_GFXCLK            (11) : enabled
12. DS_SOCCLK            (12) : enabled
13. DS_FCLK              (13) : enabled
14. DS_LCLK              (14) : enabled
15. DS_DCFCLK            (15) : enabled
16. DS_UCLK              (16) : enabled
17. GFX_ULV              (17) : enabled
18. FW_DSTATE            (18) : enabled
19. GFXOFF               (19) : enabled
20. BACO                 (20) : enabled
21. MM_DPM               (21) : enabled
22. SOC_MPCLK_DS         (22) : enabled
23. BACO_MPCLK_DS        (23) : enabled
24. THROTTLERS           (24) : enabled
25. SMARTSHIFT           (25) : disabled
26. GTHR                 (26) : disabled
27. ACDC                 (27) : disabled
28. VR0HOT               (28) : enabled
29. FW_CTF               (29) : enabled
30. FAN_CONTROL          (30) : enabled
31. GFX_DCS              (31) : disabled
32. GFX_READ_MARGIN      (32) : disabled
33. LED_DISPLAY          (33) : disabled
34. GFXCLK_SPREAD_SPECTRUM (34) : disabled
35. OUT_OF_BAND_MONITOR  (35) : enabled
36. OPTIMIZED_VMIN       (36) : enabled
37. GFX_IMU              (37) : enabled
38. BOOT_TIME_CAL        (38) : disabled
39. GFX_PCC_DFLL         (39) : enabled
40. SOC_CG               (40) : enabled
41. DF_CSTATE            (41) : enabled
42. GFX_EDC              (42) : disabled
43. BOOT_POWER_OPT       (43) : enabled
44. CLOCK_POWER_DOWN_BYPASS (44) : disabled
45. DS_VCN               (45) : enabled
46. BACO_CG              (46) : enabled
47. MEM_TEMP_READ        (47) : enabled
48. ATHUB_MMHUB_PG       (48) : enabled
49. SOC_PCC              (49) : enabled

Change Kernel Boot Parameters

Try adding the amdgpu.ppfeaturemask to the kernel boot parameters, /etc/default/grub:

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xfffd3fff"

and update Grub

sudo update-grub

Then reboot, and check if the fault happens again.

NOTE: It may be that the fault is caused by one or more of those features:

PP_OVERDRIVE_MASK = 0x4000,
PP_GFXOFF_MASK = 0x8000,
PP_STUTTER_MODE = 0x20000,

Use Mesa Environment Parameters to identify the cause

Add RADV_DEBUG=hang to the /etc/environment, then try triggering the fault again.

This dumps a report to $HOME/radv_dumps_<pid>_<time> if a GPU hang is detected.

Report the error to the Mesa Team.


References

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n199

https://docs.mesa3d.org/envvars.html