On Wed, Apr 30, 2025 at 3:55 AM Borislav Petkov <b...@alien8.de> wrote: > > + amdgpu folks > > On Tue, Apr 29, 2025 at 02:51:56PM +0200, Marcus Rückert wrote: > > Hardware: > > - ASUS ROG Swift OLED PG27AQDP @ 480 Hz > > - LG 27GL850-B @ 144Hz > > - XFX Mercury Radeon RX 9070 XT OC Gaming Edition with RGB, 16GB GDDR6, > > HDMI, 3x DP RX-97TRGBBB9 > > - Ryzen 9 9950X3D on ASUS ProArt X870E-Creator WiFi > > - be quiet! Dark Power 13 850W ATX 3.0 > > > > Software: > > - kernel-default-6.15~rc4-1.1.g62ec7c7.x86_64 from > > https://build.opensuse.org/project/show/Kernel:HEAD > > - Mesa-25.1+git442.5841d44f9-1747.1.x86_64 from > > https://build.opensuse.org/package/show/home:darix:playground/Mesa > > - GE-Proton 9-27 > > https://github.com/GloriousEggroll/proton-ge-custom/releases/tag/GE-Proton9-27 > > - Overwatch via steam > > > > ``` > > [Mon Apr 28 23:10:56 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: Dumping > > IP State > > [Mon Apr 28 23:10:56 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: Dumping > > IP State Completed > > [Mon Apr 28 23:10:56 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: [drm] > > AMDGPU device coredump file has been created > > [Mon Apr 28 23:10:56 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: [drm] > > Check your /sys/class/drm/card1/device/devcoredump/data > > [Mon Apr 28 23:10:56 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: ring > > gfx_0.0.0 timeout, but soft recovered > > [Mon Apr 28 23:11:07 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: Dumping > > IP State > > [Mon Apr 28 23:11:07 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: Dumping > > IP State Completed > > [Mon Apr 28 23:11:07 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: [drm] > > AMDGPU device coredump file has been created > > [Mon Apr 28 23:11:07 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: [drm] > > Check your /sys/class/drm/card1/device/devcoredump/data > > [Mon Apr 28 23:11:07 2025] [ T10460] amdgpu 0000:03:00.0: amdgpu: ring > > gfx_0.0.0 timeout, but soft recovered > > ``` > > > > Usually I have that like once a day or so. But yesterday it was especially > > bad: > > > > ``` > > Apr 28 21:46:57 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 21:47:08 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 21:47:18 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 21:47:28 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 21:54:34 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 22:00:40 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 22:00:50 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 22:01:00 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 23:10:56 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > Apr 28 23:11:07 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 > > timeout, but soft recovered > > ``` > > > > Together with my coworker Patrik Jakobsson and Takashi Iwai we already > > chased down a few other issues (like the dreaded flip_done). > > But this last issue remains. We first backported some fixes to our 6.14.x > > kernel for flip_done. To get even more fixes I switched to the 6.15~rc > > kernels. > > > > Then also went with Mesa 25.1~rc which didnt fix it. so now it is a > > snapshot package of main. > > > > Some observations. While gaming I started run > > https://github.com/Umio-Yasuno/amdgpu_top on the 2nd monitor to see if > > overheating might be an issue. > > > > but the memory temps are at around 82 and the GPU core itself is usually > > lower. > > One observation is that the card is supposed to have a boost clock of > > 3100MHz but amdgpu_top sees it boost over 3200. I tried both onboard bios > > and the behavior is the same. > > > > currently I run both my wayland session as well as my game with > > RADV_DEBUG=nohiz but that didnt provide more details adding nodcc drop the > > performance from 480-500Hz ( the card could go faster but I limit the game > > to 500) > > to 330-360. > > > > Please let me know, if I can provide more details
please make sure your kernel has these three patches: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4408b59eeacfea777aae397177f49748cadde5ce https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=afcdf51d97cd58dd7a2e0aa8acbaea5108fa6826 https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=366e77cd4923c3aa45341e15dcaf3377af9b042f soft recover kills stuck shaders, so I'd suggest trying a newer version of mesa and LLVM. If that doesn't help, please file a ticket here: https://gitlab.freedesktop.org/drm/amd/-/issues/ Alex Alex > > > > darix > > > > > > ``` > > -- > > Always remember: > > Never accept the world as it appears to be. > > Dare to see it for what it could be. > > The world can always use more heroes. > > > > > > > > > > ``` > > > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette