Re: [REGRESSION] RX-580 (gfx803) GPU hangs since ~v6.14.1 – “scheduler comp_1.1.1 is not ready” / ROCm 5.7-6.4+ broken

Alex Deucher Tue, 01 Jul 2025 12:01:17 -0700

That's a GPU page fault.  Something in the userspace command stream to the
GPU accessed a non-mapped page with the GPU.


Alex

On Tue, Jul 1, 2025 at 5:39 AM Johl Brown <johlbr...@gmail.com> wrote:

> Hi all, hoping I'm still on-side... Thank you for your consideration.
> Linux archb 6.14.0-rt3-arch1-1-rt #1 SMP PREEMPT_RT Wed, 21 May 2025
> 13:21:26 +0000 x86_64 GNU/Linux
>
> AMDGPU sequence
> Time Message
> 19:29:29 *GPU fault detected* (0x00020802) for process *kdeconnect-app
> (pid 2285)*; VM fault at page 2048, write from *TC0*.
> 19:29:29 Second fault (0x0000880c) for same process; VM fault at page 0,
> read from *TC6*.
> 19:29:39 *ring gfx timeout* (signaled seq 699, emitted seq 701) →
> “Starting gfx ring reset” → *Ring gfx reset failure*.
> 19:29:40 Self-tests: ring comp_1.0.1 test failed (-110) and ring
> comp_1.2.1 test failed (-110).
>
>
> On Thu, 26 Jun 2025 at 10:38, Johl Brown <johlbr...@gmail.com> wrote:
>
>> Apologies, I believe it was attached to one of the above posts. Please
>> find complete dmesg attached.
>>
>> I had previously attempted to GDB/Ghidra at (
>> https://github.com/lamikr/rocm_sdk_builder/issues/173 ) while
>> experiencing segfaults on previous kernels/roc.
>> Around Nov 3, 2024 (I can't see any comment I made there about kernel
>> version but currently Linux archb 6.14.0-rt3-arch1-1-rt #1 SMP PREEMPT_RT
>> Wed, 21 May 2025 13:21:26 +0000 x86_64 GNU/Linux. I'm just testing rt due
>> to easyeffects glitches but generally I run mainline kernel and update
>> roughly weekly so the kernel should be current for that time period)
>> eg:
>>
>> /opt/rocm_sdk_612/bin/hipcc hello_world.o -fPIE -o hello_world
>> ./hello_world
>>  System minor: 0
>>  System major: 8
>>  Agent name: AMD Radeon RX 580 Series
>> Kernel input: GdkknVnqkc
>> Expecting that kernel increases each character from input string by one
>> make: *** [Makefile:18: test] Segmentation fault (core dumped)
>>  System minor: 0
>>  System major: 8
>>  Agent name: AMD Radeon RX 580 Series
>> Kernel input: GdkknVnqkc
>> Expecting that kernel increases each character from input string by one
>> Segmentation fault (core dumped)
>>
>>
>> [New Thread 0x7fffecaea6c0 (LWP 2980691)]
>>
>> [New Thread 0x7fffe7fff6c0 (LWP 2980692)]
>>
>> [Thread 0x7fffe7fff6c0 (LWP 2980692) exited]
>>
>>  System minor: 0
>>
>>  System major: 8
>>
>>  Agent name: AMD Radeon RX 580 Series
>>
>> Kernel input: GdkknVnqkc
>>
>> Expecting that kernel increases each character from input string by one
>>
>>
>> Thread 1 "hello_world" received signal SIGSEGV, Segmentation fault.
>>
>> 0x00007ffff7db0fbd in ?? ()
>>
>>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>>
>> (gdb) bt
>>
>> #0  0x00007ffff7db0fbd in ?? ()
>>
>>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>>
>> #1  0x00007ffff7c1497f in ?? ()
>>
>>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>>
>> #2  0x00007ffff7c14c74 in ?? ()
>>
>>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>>
>> #3  0x00007ffff7c14e3e in ?? ()
>>
>>    from /opt/rocm_sdk_612/lib64/libamdhip64.so.6
>>
>> #4  0x00005555555555bf in main (argc=<optimized out>,
>>
>>     argv=<optimized out>) at hello_world.cpp:69
>>
>> (gdb)
>>
>> Line 69 (nice) is res = hipMemcpy(inputBuffer, input, (strlength + 1) *
>> sizeof(char), hipMemcpyHostToDevice); (see attached file jb_gdb_tester)
>>
>>
>>
>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35
>>
>>
>> One love!!
>>
>> On Thu, 26 Jun 2025 at 10:10, Felix Kuehling <felix.kuehl...@amd.com>
>> wrote:
>>
>>> I couldn't find a dmesg attched to the linked bug reports. I was going
>>> to look for a kernel oops from calling an uninitialized function pointer.
>>> Your patch addresses just that.
>>>
>>> I'm not sure how “drm/amdkfd: Improve signal event slow path” is
>>> implicated. I don't see anything in that patch that would break
>>> specifically on gfx v803.
>>>
>>> Regards,
>>>   Felix
>>>
>>> On 2025-06-25 18:21, Alex Deucher wrote:
>>> > Adding folks from the KFD team to take a look.  Thank you for
>>> > bisecting.  Does the attached patch fix it?
>>> >
>>> > Thanks,
>>> >
>>> > Alex
>>> >
>>> > On Wed, Jun 25, 2025 at 12:33 AM Johl Brown <johlbr...@gmail.com>
>>> wrote:
>>> >> Good Afternoon and best wishes!
>>> >> This is my first attempt at upstreaming an issue after dailying arch
>>> for a full year now :)
>>> >> Please forgive me, a lot of this is pushing my comfort zone, but
>>> preventing needless e-waste is important to me personally :) with this in
>>> mind, I will save your eyeballs and let you know I did use gpt to help
>>> compile the below, but I have proofread it several times (which means you
>>> can't be mad :p ).
>>> >>
>>> >>
>>> >> https://github.com/ROCm/ROCm/issues/4965
>>> >>
>>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779
>>> >>
>>> >>
>>> >> Hello Kernel, AMD GPU, & ROCm maintainers,
>>> >>
>>> >> TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a
>>> number of kernels since v6.14 and newer. This was not previously the case
>>> prior to 6.15 for ROCm 6.4.0 on gfx803 cards.
>>> >>
>>> >> The issue has been successfully mitigated within an older version of
>>> ROC under kernel 6.16rc2 by reverting two specific commits:
>>> >>
>>> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”,
>>> 2024-12-19)
>>> >>
>>> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx
>>> 9.4+”, 2025-03-06)
>>> >>
>>> >> Reverting both commits on top of v6.16-rc3 restores full stability
>>> and allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to
>>> run. Instability is usually immediately obvious via eg models failing to
>>> initialise, no errors (other than host dmesg)/segfault reported, which is
>>> the usual failure method under previous kernels.
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Problem Description
>>> >>
>>> >> A number of users report GPU hangs when initialising compute loads,
>>> specifically with ROCm 5.7+ workloads. This issue appears to be a
>>> regression, as it was not present in earlier kernel versions.
>>> >>
>>> >> System Information:
>>> >>
>>> >> OS: Arch Linux
>>> >>
>>> >> CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
>>> >>
>>> >> GPU: AMD Radeon RX 580 Series (gfx803)
>>> >>
>>> >> ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per
>>> rocminfo --support)
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Affected Kernels and Regression Details
>>> >>
>>> >> The problem consistently occurs on v6.14.1-rc1 and newer kernels.
>>> >>
>>> >> Last known good: v6.11
>>> >>
>>> >> First known bad: v6.12
>>> >>
>>> >> The regression has been bisected to the following two commits, as
>>> reverting them resolves the issue:
>>> >>
>>> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”,
>>> 2024-12-19)
>>> >>
>>> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”,
>>> 2025-03-06)
>>> >>
>>> >> Both patches touch amdkfd queue reset paths and are first included in
>>> the exact releases where the regression appears.
>>> >>
>>> >> Here's a summary of kernel results:
>>> >>
>>> >> Kernel | Result | Note
>>> >>
>>> >> ------- | -------- | --------
>>> >>
>>> >> 6.13.y (LTS) | OK |
>>> >>
>>> >> 6.14.0 | OK | Baseline - my last working kernel, though I am not
>>> exactly sure which subver
>>> >>
>>> >> 6.14.1-rc1 | BAD | First hang
>>> >>
>>> >> 6.15-rc1 | BAD | Hang
>>> >>
>>> >> 6.15.8 | BAD | Hang
>>> >>
>>> >> 6.16-rc3 | BAD | Hang
>>> >>
>>> >> 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored,
>>> ROCm workloads run for hours.
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Reproduction Steps
>>> >>
>>> >> Boot the system with a kernel version exhibiting the issue (e.g.,
>>> v6.14.1-rc1 or newer without the reverts).
>>> >>
>>> >> Run a ROCm workload that creates several compute queues, for example:
>>> >>
>>> >> python stable-diffusion.py
>>> >>
>>> >> faster-whisper --model medium ...
>>> >>
>>> >> Upon model initialization, an immediate driver crash occurs. This is
>>> visible on the host machine via dmesg logs.
>>> >>
>>> >> Observed Error Messages (dmesg):
>>> >>
>>> >> [drm] scheduler comp_1.1.1 is not ready, skipping
>>> >> [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout
>>> >> [message continues ad-infinitum while system functions generally]
>>> >>
>>> >> This is followed by a hard GPU reset (visible in logs, no visual
>>> artifacts), which reliably leads to a full system lockup. Python or Docker
>>> processes become unkillable, requiring a manual reboot. Over time, the
>>> desktop slowly loses interactivity.
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Bisect Details
>>> >>
>>> >> I previously attempted a git bisect (limited to drivers/gpu/drm/amd)
>>> between v6.12 and v6.15-rc1, which identified some further potentially
>>> problematic commits, however due to undersized /boot/ partition was
>>> experiencing some difficulties. In the interim, it seems a user on  the
>>> gfx803 compatibilty repo discovered the below regarding ROC 5.7:
>>> >>
>>> >> de84484c6f8b07ad0850d6c4  bad
>>> >> bac38ca057fef2c8c024fe9e  bad
>>> >>
>>> >> Cherry-picking reverts of both commits on top of v6.16-rc3 restores
>>> normal behavior; leaving either patch in place reproduces the hang.
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Relevant Log Excerpts
>>> >>
>>> >> (Full dmesg logs can be attached separately if needed)
>>> >>
>>> >> [drm] scheduler comp_1.1.1 is not ready, skipping
>>> >> [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout,
>>> signaled seq=123456 emitted seq=123459
>>> >> [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded,
>>> reset domain time = 2ms
>>> >>
>>> >> ________________________________
>>> >> References:
>>> >>
>>> >> It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,
>>> skipping ... (https://bbs.archlinux.org/viewtopic.php?id=302729)
>>> >>
>>> >> Observations about HSA and KFD backends in TinyGrad · GitHub (
>>> https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48)
>>> >>
>>> >> AMD RX580 system freeze on maximum VRAM speed (
>>> https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639
>>> )
>>> >>
>>> >> LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 (
>>> https://lkml.org/lkml/2025/4/5/394)
>>> >>
>>> >> Commits · torvalds/linux - GitHub (Link for commit de84484) (
>>> https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335
>>> )
>>> >>
>>> >> Commits · torvalds/linux - GitHub (Link for commit bac38ca) (
>>> https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980
>>> )
>>> >>
>>> >> ROCm-For-RX580/README.md at main - GitHub (
>>> https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md)
>>> >>
>>> >> ROCm 4.6.0 for gfx803 - GitHub (
>>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779
>>> )
>>> >>
>>> >> Compatibility matrices — Use ROCm on Radeon GPUs - AMD (
>>> https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html
>>> )
>>> >>
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Why this matters
>>> >>
>>> >> Although gfx803 is End-of-Life (EOL) for official ROCm support, large
>>> user communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it.
>>> Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/)
>>> demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of
>>> relatively recent kernels. This regression significantly impacts the
>>> usability of these cards for compute workloads.
>>> >>
>>> >> ________________________________
>>> >>
>>> >> Proposed Next Steps
>>> >>
>>> >> I suggest the following for further investigation:
>>> >>
>>> >> Review the interaction between the new KFD signal-event slow-path and
>>> legacy GPUs that may lack valid event IDs.
>>> >>
>>> >> Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca)
>>> returns stale doorbells on gfx803, potentially causing false positives.
>>> >>
>>> >> Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is
>>> developed.
>>> >>
>>> >> Please let me know if you require any further diagnostics or testing.
>>> I can easily rebuild kernels and provide annotated traces.
>>> >>
>>> >> Please find my working document:
>>> https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066
>>> >>
>>> >> Thanks for your time!
>>> >>
>>> >> Best regards, big love,
>>> >>
>>> >> Johl Brown
>>> >>
>>> >> johlbr...@gmail.com
>>>
>>

Re: [REGRESSION] RX-580 (gfx803) GPU hangs since ~v6.14.1 – “scheduler comp_1.1.1 is not ready” / ROCm 5.7-6.4+ broken

Reply via email to