On Thu, 2025-05-01 at 09:32 -0400, Alex Deucher wrote: > On Wed, Apr 30, 2025 at 7:28 PM Marcus Rückert <a...@nordisch.org> > wrote: > > > > On Wed, 2025-04-30 at 09:55 -0400, Alex Deucher wrote: > > > please make sure your kernel has these three patches: > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4408b59eeacfea777aae397177f49748cadde5ce > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=afcdf51d97cd58dd7a2e0aa8acbaea5108fa6826 > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=366e77cd4923c3aa45341e15dcaf3377af9b042f > > > > I am kinda sure that's the patches Takashi backported into our > > 6.14.3. > > They are already part of 6.15.rc4 no? > > Yes, I think so.
FWIW: I could trigger another flip_done timeout. https://gitlab.freedesktop.org/drm/amd/-/issues/4201 video stream (might even be using hardware decoding) seems like a good trigger for this. I think most of my flip_done issues had twitch running while doing something. > > > soft recover kills stuck shaders, so I'd suggest trying a newer > > > version of mesa and LLVM. If that doesn't help, please file a > > > ticket > > > here: > > > > Newer Mesa is building although I didnt see anything radv related. > > > > I am curious in > > https://gitlab.freedesktop.org/drm/amd/-/issues/4192 > > there is a lot more details about the crash than what I see. with > > what > > kind of flags/environment variables do I have to run to get the > > same? > > > > That issue is directly related to suspend and resume. I.e., the > issues only happen after a suspend cycle. Is that also what you are > seeing? Nope. I am just referencing it as it contains more details than I see, and I wonder what I have to do to get the same amount of extra details to provide more useful information for you. > > An observation from my latest crash: > > > > ``` > > May 01 01:05:59 steam[223306]: radv/amdgpu: The CS has been > > cancelled > > because the context is lost. This context is guilty of a soft > > recovery. > > May 01 01:06:05 steam[223306]: Game Recording - game stopped > > [gameid=2357570] > > May 01 01:06:05 steam[223306]: Removing process 352353 for gameID > > 2357570 > > ``` > > > > Is the game launched by steam inheriting that context or could it > > really be the steam process triggering it? As 223306 would be > > The kernel driver stops accepting commands from a process if it > caused > a hang unless the process recreates its context. I'm not really sure > what's going on here based on the limited context, but I suspect the > game causes a GPU hang so the recording process stopped because of > that. on the front of the ring timeout bug: I saw that dxvk had at least one issue with RDNA4 and ring timeout. https://github.com/doitsujin/dxvk/issues/4756 So i switched from glorrious eggroll's build to proton experimental from valve. I have not seen any more ring timeout bugs since. Which made me wonder why the context shows a steam binary as the owner of the context and now the wine/game process underneath and if this could be improved. hth darix -- Always remember: Never accept the world as it appears to be. Dare to see it for what it could be. The world can always use more heroes.