Hi, Starting from kernel version 6.18 I'm experiencing frequent failures and resets of the GPU, rendering the computer nearly unusable. The screen would flicker, and eventually blackout (most of the cases) or recover (fewer cases). Even if I switch to another GPU and have Radeon GPU only for rendering, it can fail and eventually kill the app that is running on it. The problem isn't present in 6.17.
My dmesg logs show something like this (a successful reset): [ 585.109939] amdgpu 0000:06:00.0: amdgpu: Dumping IP State [ 585.111758] amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed [ 585.111839] amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been created [ 585.111841] amdgpu 0000:06:00.0: amdgpu: [drm] Check your /sys/class/drm/card2/device/devcoredump/data [ 585.111844] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 timeout, signaled seq=31692, emitted seq=31694 [ 585.111847] amdgpu 0000:06:00.0: amdgpu: Process kwin_wayland pid 114 thread kwin_wayla:cs0 pid 514 [ 585.111849] amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.1.0 ring reset [ 585.269485] amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.1.0 reset failed [ 585.269490] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!. Source: 1 [ 585.331433] amdgpu 0000:06:00.0: amdgpu: MODE2 reset [ 585.338731] amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume [ 585.339090] [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000). [ 585.339113] amdgpu 0000:06:00.0: amdgpu: PSP is resuming... [ 585.361053] amdgpu 0000:06:00.0: amdgpu: reserve 0xa00000 from 0xf41e000000 for PSP TMR [ 585.593433] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available [ 585.602279] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available [ 585.602281] amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available [ 585.602282] amdgpu 0000:06:00.0: amdgpu: SMU is resuming... [ 585.602569] amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully! [ 585.602750] amdgpu 0000:06:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0 [ 585.607508] amdgpu 0000:06:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x05002C00 [ 585.880737] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 585.880742] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0 [ 585.880743] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0 [ 585.880744] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0 [ 585.880745] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 585.880746] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 585.880747] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 585.880748] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 585.880749] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 585.880751] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 585.880752] amdgpu 0000:06:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0 [ 585.880753] amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0 [ 585.880754] amdgpu 0000:06:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 [ 585.880755] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 [ 585.880756] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 [ 585.880757] amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 [ 585.884345] amdgpu 0000:06:00.0: amdgpu: GPU reset(1) succeeded! [ 585.884371] amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset [ 585.897300] amdgpu 0000:06:00.0: amdgpu: [drm] *ERROR* Failed to initialize parser -125! I'm on an ASUS laptop with Ryzen 7940HX/Radeon 610M. I'm using a distribution kernel, but the maintainers are slow to respond, so forgive me for sending messages here. I use a custom kernel command line amdgpu.dcdebugmask=0x10 to work around kernel lockup problems, which is a separate problem that's been around since ~6.12. I've collected more dmesg logs other than what's shown above, as well as device coredumps from /sys/class/drm/card/device/devcoredump/data. I'm also happy to help with bisecting the problem if it's not too large. Let me know how I could help. Best regards, Yunchen
