Hi,

Starting from kernel version 6.18 I'm experiencing frequent failures and
resets of the GPU, rendering the computer nearly unusable. The screen would
flicker, and eventually blackout (most of the cases) or recover (fewer cases).
Even if I switch to another GPU and have Radeon GPU only for rendering, it can
fail and eventually kill the app that is running on it. The problem isn't
present in 6.17.

My dmesg logs show something like this (a successful reset):

[  585.109939] amdgpu 0000:06:00.0: amdgpu: Dumping IP State
[  585.111758] amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed
[  585.111839] amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file
has been created
[  585.111841] amdgpu 0000:06:00.0: amdgpu: [drm] Check your
/sys/class/drm/card2/device/devcoredump/data
[  585.111844] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 timeout, signaled
seq=31692, emitted seq=31694
[  585.111847] amdgpu 0000:06:00.0: amdgpu:  Process kwin_wayland pid 114
thread kwin_wayla:cs0 pid 514
[  585.111849] amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.1.0 ring reset
[  585.269485] amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.1.0 reset failed
[  585.269490] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!. Source:  1
[  585.331433] amdgpu 0000:06:00.0: amdgpu: MODE2 reset
[  585.338731] amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to
resume
[  585.339090] [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).
[  585.339113] amdgpu 0000:06:00.0: amdgpu: PSP is resuming...
[  585.361053] amdgpu 0000:06:00.0: amdgpu: reserve 0xa00000 from 0xf41e000000
for PSP TMR
[  585.593433] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not
available
[  585.602279] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not
available
[  585.602281] amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: optional
securedisplay ta ucode is not available
[  585.602282] amdgpu 0000:06:00.0: amdgpu: SMU is resuming...
[  585.602569] amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully!
[  585.602750] amdgpu 0000:06:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
[  585.607508] amdgpu 0000:06:00.0: amdgpu: [drm] DMUB hardware initialized:
version=0x05002C00
[  585.880737] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0
on hub 0
[  585.880742] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1
on hub 0
[  585.880743] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4
on hub 0
[  585.880744] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5
on hub 0
[  585.880745] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6
on hub 0
[  585.880746] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7
on hub 0
[  585.880747] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8
on hub 0
[  585.880748] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9
on hub 0
[  585.880749] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10
on hub 0
[  585.880751] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11
on hub 0
[  585.880752] amdgpu 0000:06:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng
12 on hub 0
[  585.880753] amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on
hub 0
[  585.880754] amdgpu 0000:06:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0
on hub 8
[  585.880755] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1
on hub 8
[  585.880756] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4
on hub 8
[  585.880757] amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on
hub 8
[  585.884345] amdgpu 0000:06:00.0: amdgpu: GPU reset(1) succeeded!
[  585.884371] amdgpu 0000:06:00.0: [drm] device wedged, but recovered through
reset
[  585.897300] amdgpu 0000:06:00.0: amdgpu: [drm] *ERROR* Failed to initialize
parser -125!

I'm on an ASUS laptop with Ryzen 7940HX/Radeon 610M. I'm using a distribution
kernel, but the maintainers are slow to respond, so forgive me for sending
messages here. I use a custom kernel command line amdgpu.dcdebugmask=0x10 to
work around kernel lockup problems, which is a separate problem that's been
around since ~6.12.

I've collected more dmesg logs other than what's shown above, as well as
device coredumps from /sys/class/drm/card/device/devcoredump/data. I'm also
happy to help with bisecting the problem if it's not too large. Let me know
how I could help.

Best regards,
Yunchen

Reply via email to