On Fri, 11 Apr 2025 16:39:02 +0200 Boris Brezillon <boris.brezil...@collabora.com> wrote:
> On Fri, 11 Apr 2025 15:13:26 +0200 > Christian König <christian.koe...@amd.com> wrote: > > > > > > >> Background is that you don't get a crash, nor error message, nor > > >> anything indicating what is happening. > > > The job times out at some point, but we might get stuck in the fault > > > handler waiting for memory, which is pretty close to a deadlock, I > > > suspect. > > > > I don't know those drivers that well, but at least for amdgpu the > > problem would be that the timeout handling would need to grab some of > > the locks the memory management is holding waiting for the timeout > > handling to do something.... > > > > So that basically perfectly closes the circle. With a bit of lock you > > get a message after some time that the kernel is stuck, but since > > that are all sleeping locks I strongly doubt so. > > > > As immediately action please provide patches which changes those > > GFP_KERNEL into GFP_NOWAIT. > > Sure, I can do that. Hm, I might have been too prompt at claiming this was doable. In practice, doing that might regress Lima and Panfrost in situations where trying harder than GFP_NOWAIT would free up some memory. Not saying this was right to use GFP_KERNEL in the first place, but some expectations were set by this original mistake, so I'll probably need Lima developers to vouch in for this change after they've done some testing on a system under high memory pressure, and I'd need to do the same kind of testing for Panfrost and ask Steve if he's okay with that too. For Panthor, I'm less worried, because we have the incremental rendering fallback, and assuming GFP_NOWAIT tries hard enough to reclaim low-hanging fruits, the perfs shouldn't suffer much more than they would today with GFP_KERNEL allocations potentially delaying tiling operations longer than would have been with a primitive flush.