That's a GPU page fault. Something in the userspace command stream to the GPU accessed a non-mapped page with the GPU.
Alex On Tue, Jul 1, 2025 at 5:39 AM Johl Brown <johlbr...@gmail.com> wrote: > Hi all, hoping I'm still on-side... Thank you for your consideration. > Linux archb 6.14.0-rt3-arch1-1-rt #1 SMP PREEMPT_RT Wed, 21 May 2025 > 13:21:26 +0000 x86_64 GNU/Linux > > AMDGPU sequence > Time Message > 19:29:29 *GPU fault detected* (0x00020802) for process *kdeconnect-app > (pid 2285)*; VM fault at page 2048, write from *TC0*. > 19:29:29 Second fault (0x0000880c) for same process; VM fault at page 0, > read from *TC6*. > 19:29:39 *ring gfx timeout* (signaled seq 699, emitted seq 701) → > “Starting gfx ring reset” → *Ring gfx reset failure*. > 19:29:40 Self-tests: ring comp_1.0.1 test failed (-110) and ring > comp_1.2.1 test failed (-110). > > > On Thu, 26 Jun 2025 at 10:38, Johl Brown <johlbr...@gmail.com> wrote: > >> Apologies, I believe it was attached to one of the above posts. Please >> find complete dmesg attached. >> >> I had previously attempted to GDB/Ghidra at ( >> https://github.com/lamikr/rocm_sdk_builder/issues/173 ) while >> experiencing segfaults on previous kernels/roc. >> Around Nov 3, 2024 (I can't see any comment I made there about kernel >> version but currently Linux archb 6.14.0-rt3-arch1-1-rt #1 SMP PREEMPT_RT >> Wed, 21 May 2025 13:21:26 +0000 x86_64 GNU/Linux. I'm just testing rt due >> to easyeffects glitches but generally I run mainline kernel and update >> roughly weekly so the kernel should be current for that time period) >> eg: >> >> /opt/rocm_sdk_612/bin/hipcc hello_world.o -fPIE -o hello_world >> ./hello_world >> System minor: 0 >> System major: 8 >> Agent name: AMD Radeon RX 580 Series >> Kernel input: GdkknVnqkc >> Expecting that kernel increases each character from input string by one >> make: *** [Makefile:18: test] Segmentation fault (core dumped) >> System minor: 0 >> System major: 8 >> Agent name: AMD Radeon RX 580 Series >> Kernel input: GdkknVnqkc >> Expecting that kernel increases each character from input string by one >> Segmentation fault (core dumped) >> >> >> [New Thread 0x7fffecaea6c0 (LWP 2980691)] >> >> [New Thread 0x7fffe7fff6c0 (LWP 2980692)] >> >> [Thread 0x7fffe7fff6c0 (LWP 2980692) exited] >> >> System minor: 0 >> >> System major: 8 >> >> Agent name: AMD Radeon RX 580 Series >> >> Kernel input: GdkknVnqkc >> >> Expecting that kernel increases each character from input string by one >> >> >> Thread 1 "hello_world" received signal SIGSEGV, Segmentation fault. >> >> 0x00007ffff7db0fbd in ?? () >> >> from /opt/rocm_sdk_612/lib64/libamdhip64.so.6 >> >> (gdb) bt >> >> #0 0x00007ffff7db0fbd in ?? () >> >> from /opt/rocm_sdk_612/lib64/libamdhip64.so.6 >> >> #1 0x00007ffff7c1497f in ?? () >> >> from /opt/rocm_sdk_612/lib64/libamdhip64.so.6 >> >> #2 0x00007ffff7c14c74 in ?? () >> >> from /opt/rocm_sdk_612/lib64/libamdhip64.so.6 >> >> #3 0x00007ffff7c14e3e in ?? () >> >> from /opt/rocm_sdk_612/lib64/libamdhip64.so.6 >> >> #4 0x00005555555555bf in main (argc=<optimized out>, >> >> argv=<optimized out>) at hello_world.cpp:69 >> >> (gdb) >> >> Line 69 (nice) is res = hipMemcpy(inputBuffer, input, (strlength + 1) * >> sizeof(char), hipMemcpyHostToDevice); (see attached file jb_gdb_tester) >> >> >> >> https://github.com/robertrosenbusch/gfx803_rocm/issues/35 >> >> >> One love!! >> >> On Thu, 26 Jun 2025 at 10:10, Felix Kuehling <felix.kuehl...@amd.com> >> wrote: >> >>> I couldn't find a dmesg attched to the linked bug reports. I was going >>> to look for a kernel oops from calling an uninitialized function pointer. >>> Your patch addresses just that. >>> >>> I'm not sure how “drm/amdkfd: Improve signal event slow path” is >>> implicated. I don't see anything in that patch that would break >>> specifically on gfx v803. >>> >>> Regards, >>> Felix >>> >>> On 2025-06-25 18:21, Alex Deucher wrote: >>> > Adding folks from the KFD team to take a look. Thank you for >>> > bisecting. Does the attached patch fix it? >>> > >>> > Thanks, >>> > >>> > Alex >>> > >>> > On Wed, Jun 25, 2025 at 12:33 AM Johl Brown <johlbr...@gmail.com> >>> wrote: >>> >> Good Afternoon and best wishes! >>> >> This is my first attempt at upstreaming an issue after dailying arch >>> for a full year now :) >>> >> Please forgive me, a lot of this is pushing my comfort zone, but >>> preventing needless e-waste is important to me personally :) with this in >>> mind, I will save your eyeballs and let you know I did use gpt to help >>> compile the below, but I have proofread it several times (which means you >>> can't be mad :p ). >>> >> >>> >> >>> >> https://github.com/ROCm/ROCm/issues/4965 >>> >> >>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779 >>> >> >>> >> >>> >> Hello Kernel, AMD GPU, & ROCm maintainers, >>> >> >>> >> TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a >>> number of kernels since v6.14 and newer. This was not previously the case >>> prior to 6.15 for ROCm 6.4.0 on gfx803 cards. >>> >> >>> >> The issue has been successfully mitigated within an older version of >>> ROC under kernel 6.16rc2 by reverting two specific commits: >>> >> >>> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, >>> 2024-12-19) >>> >> >>> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx >>> 9.4+”, 2025-03-06) >>> >> >>> >> Reverting both commits on top of v6.16-rc3 restores full stability >>> and allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to >>> run. Instability is usually immediately obvious via eg models failing to >>> initialise, no errors (other than host dmesg)/segfault reported, which is >>> the usual failure method under previous kernels. >>> >> >>> >> ________________________________ >>> >> >>> >> Problem Description >>> >> >>> >> A number of users report GPU hangs when initialising compute loads, >>> specifically with ROCm 5.7+ workloads. This issue appears to be a >>> regression, as it was not present in earlier kernel versions. >>> >> >>> >> System Information: >>> >> >>> >> OS: Arch Linux >>> >> >>> >> CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz >>> >> >>> >> GPU: AMD Radeon RX 580 Series (gfx803) >>> >> >>> >> ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per >>> rocminfo --support) >>> >> >>> >> ________________________________ >>> >> >>> >> Affected Kernels and Regression Details >>> >> >>> >> The problem consistently occurs on v6.14.1-rc1 and newer kernels. >>> >> >>> >> Last known good: v6.11 >>> >> >>> >> First known bad: v6.12 >>> >> >>> >> The regression has been bisected to the following two commits, as >>> reverting them resolves the issue: >>> >> >>> >> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, >>> 2024-12-19) >>> >> >>> >> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”, >>> 2025-03-06) >>> >> >>> >> Both patches touch amdkfd queue reset paths and are first included in >>> the exact releases where the regression appears. >>> >> >>> >> Here's a summary of kernel results: >>> >> >>> >> Kernel | Result | Note >>> >> >>> >> ------- | -------- | -------- >>> >> >>> >> 6.13.y (LTS) | OK | >>> >> >>> >> 6.14.0 | OK | Baseline - my last working kernel, though I am not >>> exactly sure which subver >>> >> >>> >> 6.14.1-rc1 | BAD | First hang >>> >> >>> >> 6.15-rc1 | BAD | Hang >>> >> >>> >> 6.15.8 | BAD | Hang >>> >> >>> >> 6.16-rc3 | BAD | Hang >>> >> >>> >> 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored, >>> ROCm workloads run for hours. >>> >> >>> >> ________________________________ >>> >> >>> >> Reproduction Steps >>> >> >>> >> Boot the system with a kernel version exhibiting the issue (e.g., >>> v6.14.1-rc1 or newer without the reverts). >>> >> >>> >> Run a ROCm workload that creates several compute queues, for example: >>> >> >>> >> python stable-diffusion.py >>> >> >>> >> faster-whisper --model medium ... >>> >> >>> >> Upon model initialization, an immediate driver crash occurs. This is >>> visible on the host machine via dmesg logs. >>> >> >>> >> Observed Error Messages (dmesg): >>> >> >>> >> [drm] scheduler comp_1.1.1 is not ready, skipping >>> >> [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout >>> >> [message continues ad-infinitum while system functions generally] >>> >> >>> >> This is followed by a hard GPU reset (visible in logs, no visual >>> artifacts), which reliably leads to a full system lockup. Python or Docker >>> processes become unkillable, requiring a manual reboot. Over time, the >>> desktop slowly loses interactivity. >>> >> >>> >> ________________________________ >>> >> >>> >> Bisect Details >>> >> >>> >> I previously attempted a git bisect (limited to drivers/gpu/drm/amd) >>> between v6.12 and v6.15-rc1, which identified some further potentially >>> problematic commits, however due to undersized /boot/ partition was >>> experiencing some difficulties. In the interim, it seems a user on the >>> gfx803 compatibilty repo discovered the below regarding ROC 5.7: >>> >> >>> >> de84484c6f8b07ad0850d6c4 bad >>> >> bac38ca057fef2c8c024fe9e bad >>> >> >>> >> Cherry-picking reverts of both commits on top of v6.16-rc3 restores >>> normal behavior; leaving either patch in place reproduces the hang. >>> >> >>> >> ________________________________ >>> >> >>> >> Relevant Log Excerpts >>> >> >>> >> (Full dmesg logs can be attached separately if needed) >>> >> >>> >> [drm] scheduler comp_1.1.1 is not ready, skipping >>> >> [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout, >>> signaled seq=123456 emitted seq=123459 >>> >> [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, >>> reset domain time = 2ms >>> >> >>> >> ________________________________ >>> >> References: >>> >> >>> >> It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, >>> skipping ... (https://bbs.archlinux.org/viewtopic.php?id=302729) >>> >> >>> >> Observations about HSA and KFD backends in TinyGrad · GitHub ( >>> https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48) >>> >> >>> >> AMD RX580 system freeze on maximum VRAM speed ( >>> https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639 >>> ) >>> >> >>> >> LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 ( >>> https://lkml.org/lkml/2025/4/5/394) >>> >> >>> >> Commits · torvalds/linux - GitHub (Link for commit de84484) ( >>> https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335 >>> ) >>> >> >>> >> Commits · torvalds/linux - GitHub (Link for commit bac38ca) ( >>> https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980 >>> ) >>> >> >>> >> ROCm-For-RX580/README.md at main - GitHub ( >>> https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md) >>> >> >>> >> ROCm 4.6.0 for gfx803 - GitHub ( >>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779 >>> ) >>> >> >>> >> Compatibility matrices — Use ROCm on Radeon GPUs - AMD ( >>> https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html >>> ) >>> >> >>> >> >>> >> ________________________________ >>> >> >>> >> Why this matters >>> >> >>> >> Although gfx803 is End-of-Life (EOL) for official ROCm support, large >>> user communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it. >>> Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/) >>> demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of >>> relatively recent kernels. This regression significantly impacts the >>> usability of these cards for compute workloads. >>> >> >>> >> ________________________________ >>> >> >>> >> Proposed Next Steps >>> >> >>> >> I suggest the following for further investigation: >>> >> >>> >> Review the interaction between the new KFD signal-event slow-path and >>> legacy GPUs that may lack valid event IDs. >>> >> >>> >> Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca) >>> returns stale doorbells on gfx803, potentially causing false positives. >>> >> >>> >> Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is >>> developed. >>> >> >>> >> Please let me know if you require any further diagnostics or testing. >>> I can easily rebuild kernels and provide annotated traces. >>> >> >>> >> Please find my working document: >>> https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066 >>> >> >>> >> Thanks for your time! >>> >> >>> >> Best regards, big love, >>> >> >>> >> Johl Brown >>> >> >>> >> johlbr...@gmail.com >>> >>