On Mon, Dec 8, 2025 at 8:59 AM Mack Wang <[email protected]> wrote: > > Hi, > > Starting from kernel version 6.18 I'm experiencing frequent failures and > resets of the GPU, rendering the computer nearly unusable. The screen would > flicker, and eventually blackout (most of the cases) or recover (fewer cases). > Even if I switch to another GPU and have Radeon GPU only for rendering, it can > fail and eventually kill the app that is running on it. The problem isn't > present in 6.17. > > My dmesg logs show something like this (a successful reset): > > [ 585.109939] amdgpu 0000:06:00.0: amdgpu: Dumping IP State > [ 585.111758] amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed > [ 585.111839] amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file > has been created > [ 585.111841] amdgpu 0000:06:00.0: amdgpu: [drm] Check your > /sys/class/drm/card2/device/devcoredump/data > [ 585.111844] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 timeout, signaled > seq=31692, emitted seq=31694 > [ 585.111847] amdgpu 0000:06:00.0: amdgpu: Process kwin_wayland pid 114 > thread kwin_wayla:cs0 pid 514 > [ 585.111849] amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.1.0 ring reset > [ 585.269485] amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.1.0 reset failed > [ 585.269490] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!. Source: 1 > [ 585.331433] amdgpu 0000:06:00.0: amdgpu: MODE2 reset > [ 585.338731] amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to > resume > [ 585.339090] [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000). > [ 585.339113] amdgpu 0000:06:00.0: amdgpu: PSP is resuming... > [ 585.361053] amdgpu 0000:06:00.0: amdgpu: reserve 0xa00000 from 0xf41e000000 > for PSP TMR > [ 585.593433] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not > available > [ 585.602279] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not > available > [ 585.602281] amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: optional > securedisplay ta ucode is not available > [ 585.602282] amdgpu 0000:06:00.0: amdgpu: SMU is resuming... > [ 585.602569] amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully! > [ 585.602750] amdgpu 0000:06:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0 > [ 585.607508] amdgpu 0000:06:00.0: amdgpu: [drm] DMUB hardware initialized: > version=0x05002C00 > [ 585.880737] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 > on hub 0 > [ 585.880742] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 > on hub 0 > [ 585.880743] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 > on hub 0 > [ 585.880744] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 > on hub 0 > [ 585.880745] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 > on hub 0 > [ 585.880746] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 > on hub 0 > [ 585.880747] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 > on hub 0 > [ 585.880748] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 > on hub 0 > [ 585.880749] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 > on hub 0 > [ 585.880751] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 > on hub 0 > [ 585.880752] amdgpu 0000:06:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng > 12 on hub 0 > [ 585.880753] amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on > hub 0 > [ 585.880754] amdgpu 0000:06:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 > on hub 8 > [ 585.880755] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 > on hub 8 > [ 585.880756] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 > on hub 8 > [ 585.880757] amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on > hub 8 > [ 585.884345] amdgpu 0000:06:00.0: amdgpu: GPU reset(1) succeeded! > [ 585.884371] amdgpu 0000:06:00.0: [drm] device wedged, but recovered through > reset > [ 585.897300] amdgpu 0000:06:00.0: amdgpu: [drm] *ERROR* Failed to initialize > parser -125! > > I'm on an ASUS laptop with Ryzen 7940HX/Radeon 610M. I'm using a distribution > kernel, but the maintainers are slow to respond, so forgive me for sending > messages here. I use a custom kernel command line amdgpu.dcdebugmask=0x10 to > work around kernel lockup problems, which is a separate problem that's been > around since ~6.12. > > I've collected more dmesg logs other than what's shown above, as well as > device coredumps from /sys/class/drm/card/device/devcoredump/data. I'm also > happy to help with bisecting the problem if it's not too large. Let me know > how I could help.
Please file a ticket here: https://gitlab.freedesktop.org/drm/amd/-/issues and if you could bisect, that would be really helpful. Thanks! Alex
