Hi Iago, On Mon, 28 Apr 2025 08:55:07 +0200 Iago Toral <ito...@igalia.com> wrote:
> Hi, > > Pitching in to describe the situation for v3d: Thanks for chiming in. > > El vie, 18-04-2025 a las 14:25 +0200, Boris Brezillon escribió: > > (...) > > +For all these reasons, the tiler usually allocates memory > > dynamically, but > > +DRM has not been designed with this use case in mind. Drivers will > > address > > +these problems differently based on the functionality provided by > > their > > +hardware, but all of them almost certainly have to deal with this > > somehow. > > + > > +The easy solution is to statically allocate a huge buffer to pick > > from when > > +tiler memory is needed, and fail the rendering when this buffer is > > depleted. > > +Some drivers try to be smarter to avoid reserving a lot of memory > > upfront. > > +Instead, they start with an almost empty buffer and progressively > > populate it > > +when the GPU faults on an address sitting in the tiler buffer range. > > This > > +works okay most of the time but it falls short when the system is > > under > > +memory pressure, because the memory request is not guaranteed to be > > satisfied. > > +In that case, the driver either fails the rendering, or, if the > > hardware > > +allows it, it tries to flush the primitives that have been processed > > and > > +triggers a fragment job that will consume those primitives and free > > up some > > +memory to be recycled and make further progress on the tiling step. > > This is > > +usually referred as partial/incremental rendering (it might have > > other names). > > In our case, user space allocates some memory up front hoping to avoid > running out of memory during tiling, but if the tiler does run out of > memory we get an interrupt and the tiler hw will stop and wait for the > kernel driver to write back an address where more memory is made > available (via register write), which we will try to allocate at that > point. This can happen any number of times until the tiler job > completes Sounds very much like how new Mali-CSF works, except Mali-CSF also has a fallback for when the allocation can't be satisfied. > > I am not sure that we are handling allocation failure on this path > nicely at the moment since we don't try to fail and cancel the job, > that's maybe something we should fix, although I don't personally > recall any reports of us running into this situation either. Yeah, I'd say you're pretty much in the same place Panfrost/Panthor are at the moment: we're not playing by the dma_fence rules, but no user complained so far. BTW, that doesn't necessarily mean the problem doesn't occur, just that it's not been identified as being a KMD issue :-). > > > > + > > +Compute based emulation of geometry stages > > +------------------------------------------ > > + > > +More and more hardware vendors don't bother providing hardware > > support for > > +geometry/tesselation/mesh stages, since those can be emulated with > > compute > > +shaders. But the same problem we have with tiler memory exists with > > those > > +intermediate compute-emulated stages, because transient data shared > > between > > +stages need to be stored in memory for the next stage to consume, > > and this > > +bubbles up until the tiling stage is reached, because ultimately, > > what the > > +tiling stage will need to process is a set of vertices it can turn > > into > > +primitives, like would happen if the application had emulated the > > geometry, > > +tesselation or mesh stages with compute. > > + > > +Unlike tiling, where the hardware can provide a fallback to recycle > > memory, > > +there is no way the intermediate primitives can be flushed up to the > > framebuffer, > > +because it's a purely software emulation here. This being said, the > > same > > +"start small, grow on-demand" can be applied to avoid over- > > allocating memory > > +upfront. > > FWIW, v3d has geometry and tessellation hardware. Yep, Alyssa mentioned that. I'll change this section to specifically mention Arm/Mali as being the outlier here. > > > > + > > +On-demand memory allocation > > +--------------------------- > > + > > +As explained in previous sections, on-demand allocation is a central > > piece > > +of tile-based renderer if we don't want to over-allocate, which is > > bad for > > +integrated GPUs who share their memory with the rest of the system. > > + > > +The problem with on-demand allocation is that suddenly, GPU accesses > > can > > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem > > mostly) > > +were not designed for that. Those are assuming that buffers memory > > is > > +populated at job submission time, and will stay around for the job > > lifetime. > > +If a GPU fault happens, it's the user fault, and the context can be > > flagged > > +unusable. On-demand allocation is usually implemented as allocation- > > on-fault, > > +and the dma_fence contract prevents us from blocking on allocations > > in that > > +path (GPU fault handlers are in the dma-fence signalling path). > > As I described above, v3d is not quite an allocation-on-fault mechanism > but rather, we get a dedicated interrupt from the hw when it needs more > memory, which I believe happens a bit before it completely runs out of > memory actually. Maybe that changes the picture since we don't exactly > use a fault handler? Not really. Any mechanism relying on on-demand allocation in the dma_fence signalling path is problematic. The fact it's based on a fault handler might add extra problems on top, but both designs violate the dma_fence contract stating that no non-fallible allocation should be done in the dma_fence signalling path (that is, any allocation happening between the moment the job was queued to the drm_sched_entity, and the moment the job fence is signalled). Given, the description you made, I think we can add v3d to the list of problematic drivers :-(. Regards, Boris