Random crashes when using a browser or icaclient (Citrix client).
[ 85.861734] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=13365, emitted seq=13367 [ 85.862162] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_x11 pid 819 thread kwin_x11:cs0 pid 838
It is often a power saving issue.
echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
Use the amdgpu.ppfeaturemask parameter to narrow down which power feature is causing problems.
cat /sys/class/drm/card0/device/pp_features
returns:
features high: 0x0003ebb8 low: 0x71ffffff No. Feature Bit : State 00. FW_DATA_READ ( 0) : enabled 01. DPM_GFXCLK ( 1) : enabled 02. DPM_GFX_POWER_OPTIMIZER ( 2) : enabled 03. DPM_UCLK ( 3) : enabled 04. DPM_FCLK ( 4) : enabled 05. DPM_SOCCLK ( 5) : enabled 06. DPM_MP0CLK ( 6) : enabled 07. DPM_LINK ( 7) : enabled 08. DPM_DCN ( 8) : enabled 09. VMEMP_SCALING ( 9) : enabled 10. VDDIO_MEM_SCALING (10) : enabled 11. DS_GFXCLK (11) : enabled 12. DS_SOCCLK (12) : enabled 13. DS_FCLK (13) : enabled 14. DS_LCLK (14) : enabled 15. DS_DCFCLK (15) : enabled 16. DS_UCLK (16) : enabled 17. GFX_ULV (17) : enabled 18. FW_DSTATE (18) : enabled 19. GFXOFF (19) : enabled 20. BACO (20) : enabled 21. MM_DPM (21) : enabled 22. SOC_MPCLK_DS (22) : enabled 23. BACO_MPCLK_DS (23) : enabled 24. THROTTLERS (24) : enabled 25. SMARTSHIFT (25) : disabled 26. GTHR (26) : disabled 27. ACDC (27) : disabled 28. VR0HOT (28) : enabled 29. FW_CTF (29) : enabled 30. FAN_CONTROL (30) : enabled 31. GFX_DCS (31) : disabled 32. GFX_READ_MARGIN (32) : disabled 33. LED_DISPLAY (33) : disabled 34. GFXCLK_SPREAD_SPECTRUM (34) : disabled 35. OUT_OF_BAND_MONITOR (35) : enabled 36. OPTIMIZED_VMIN (36) : enabled 37. GFX_IMU (37) : enabled 38. BOOT_TIME_CAL (38) : disabled 39. GFX_PCC_DFLL (39) : enabled 40. SOC_CG (40) : enabled 41. DF_CSTATE (41) : enabled 42. GFX_EDC (42) : disabled 43. BOOT_POWER_OPT (43) : enabled 44. CLOCK_POWER_DOWN_BYPASS (44) : disabled 45. DS_VCN (45) : enabled 46. BACO_CG (46) : enabled 47. MEM_TEMP_READ (47) : enabled 48. ATHUB_MMHUB_PG (48) : enabled 49. SOC_PCC (49) : enabled
Try adding the amdgpu.ppfeaturemask to the kernel boot parameters, /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xfffd3fff"
and update Grub
sudo update-grub
Then reboot, and check if the fault happens again.
NOTE: It may be that the fault is caused by one or more of those features:
PP_OVERDRIVE_MASK = 0x4000, PP_GFXOFF_MASK = 0x8000, PP_STUTTER_MODE = 0x20000,
Add RADV_DEBUG=hang to the /etc/environment, then try triggering the fault again.
This dumps a report to $HOME/radv_dumps_<pid>_<time> if a GPU hang is detected.
Report the error to the Mesa Team.
NOTE: See https://docs.mesa3d.org/envvars.html