Hi Boris, Am 10.04.25 um 17:53 schrieb Boris Brezillon: > Hi Christian, > > On Thu, 10 Apr 2025 17:05:07 +0200 > Christian König <christian.koe...@amd.com> wrote: > >> Hi Boris, >> >> thanks for looping me in. Can you also send the full patch set to me since I >> don't see that on the mailing list (yet maybe). >> >> Am 10.04.25 um 16:48 schrieb Boris Brezillon: >>> +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to >>> Cc you on v4, but before that, I'd like to get your opinion on the >>> drm-gem/drm-gem-shmem changes to see if sending a v4 is actually >>> desirable or if I should go back to the drawing board. >>> >>> On Fri, 4 Apr 2025 11:26:26 +0200 >>> Boris Brezillon <boris.brezil...@collabora.com> wrote: >>> >>>> This patch series is a proposal for implementing sparse page allocations >>>> for shmem objects. It was initially motivated by a kind of BO managed by >>>> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on >>>> demand every time the GPU faults on a virtual address within the heap BO >>>> range. >> Oh, wait a second! GPU faults and DMA fences are usually fundamentally >> incompatible. >> >> So stuff like filling in GEM objects on demand like you suggest here is >> usually seen as illegal. All resources must be pre-pinned before the >> submission. > Unfortunately, that's already how it's done in lima, panfrost and > panthor. > >> Faulting is only legal when you have something like HMM, SVM or whatever you >> call it. And then you can just use a plain shmem object to provide you with >> backing pages. >> >> I mean we could in theory allow faulting on GEM objects as well, but we >> would need to take very strict precautions on that we currently don't have >> as far as I know. > We only use this mechanism for very specific allocations: tiler memory > whose maximum size can't be guessed upfront because tile binning is by > nature unpredictible. > >> So could you explain how this works in the first place? > I can explain you how this works in Panthor, yes. You get an initial > amount of memory that the tiler can use, when it runs out of memory, it > will first ask the system for more memory, if the allocation fails, it > will fallback to what they call "incremental rendering", where the > already binned primitives are flushed to the FB in order to free memory, > and the rendering starts over from there, with the memory that has been > freed. > > In Panthor, this on-demand allocation scheme is something that allows > us to speed-up rendering when there's memory available, but we can make > progress when that's not the case, hence the failable allocation scheme > I'm proposing here.
Puh, that sounds like it can potentially work but is also very very fragile. You must have a code comment when you select the GFP flags how and why that works. > In Panfrost and Lima, we don't have this concept of "incremental > rendering", so when we fail the allocation, we just fail the GPU job > with an unhandled GPU fault. To be honest I think that this is enough to mark those two drivers as broken. It's documented that this approach is a no-go for upstream drivers. How widely is that used? > And that's how it is today, the > alloc-on-fault mechanism is being used in at least 3 drivers, and all > I'm trying to do here is standardize it and try to document the > constraints that comes with this model, constraint that are currently > being ignored. Like the fact allocations in the fault handler path > shouldn't block so we're guaranteed to signal the job fence in finite > time, and we don't risk a deadlock between the driver shrinker and the > job triggering the fault. Well on one hand it's good to that we document the pitfalls, but I clearly think that we should *not* spread that into common code. That would give the impression that this actually works and to be honest I should start to charge money to rejecting stuff like that. It would probably get me rich. > I'm well aware of the implications of what I'm proposing here, but > ignoring the fact some drivers already violate the rules don't make them > disappear. So I'd rather work with you and others to clearly state what > the alloc-in-fault-path rules are, and enforce them in some helper > functions, than pretend these ugly things don't exist. :D Yeah, but this kind of says to others it's ok to do this which clearly isn't as far as I can see. What we should do instead is to add this as not ok approaches to the documentation and move on. Regards, Christian. > > Regards, > > Boris