Hi Iago,

On Mon, 28 Apr 2025 08:55:07 +0200
Iago Toral <ito...@igalia.com> wrote:

> Hi,
> 
> Pitching in to describe the situation for v3d:

Thanks for chiming in.

> 
> El vie, 18-04-2025 a las 14:25 +0200, Boris Brezillon escribió:
> 
> (...)
> > +For all these reasons, the tiler usually allocates memory
> > dynamically, but
> > +DRM has not been designed with this use case in mind. Drivers will
> > address
> > +these problems differently based on the functionality provided by
> > their
> > +hardware, but all of them almost certainly have to deal with this
> > somehow.
> > +
> > +The easy solution is to statically allocate a huge buffer to pick
> > from when
> > +tiler memory is needed, and fail the rendering when this buffer is
> > depleted.
> > +Some drivers try to be smarter to avoid reserving a lot of memory
> > upfront.
> > +Instead, they start with an almost empty buffer and progressively
> > populate it
> > +when the GPU faults on an address sitting in the tiler buffer range.
> > This
> > +works okay most of the time but it falls short when the system is
> > under
> > +memory pressure, because the memory request is not guaranteed to be
> > satisfied.
> > +In that case, the driver either fails the rendering, or, if the
> > hardware
> > +allows it, it tries to flush the primitives that have been processed
> > and
> > +triggers a fragment job that will consume those primitives and free
> > up some
> > +memory to be recycled and make further progress on the tiling step.
> > This is
> > +usually referred as partial/incremental rendering (it might have
> > other names).  
> 
> In our case, user space allocates some memory up front hoping to avoid
> running out of memory during tiling, but if the tiler does run out of
> memory we get an interrupt and the tiler hw will stop and wait for the
> kernel driver to write back an address where more memory is made
> available (via register write), which we will try to allocate at that
> point. This can happen any number of times until the tiler job
> completes

Sounds very much like how new Mali-CSF works, except Mali-CSF also has
a fallback for when the allocation can't be satisfied.

> 
> I am not sure that we are handling allocation failure on this path 
> nicely at the moment since we don't try to fail and cancel the job,
> that's maybe something we should fix, although I don't personally
> recall any reports of us running into this situation either.

Yeah, I'd say you're pretty much in the same place Panfrost/Panthor are
at the moment: we're not playing by the dma_fence rules, but no user
complained so far. BTW, that doesn't necessarily mean the problem
doesn't occur, just that it's not been identified as being a KMD issue
:-).

> 
> 
> > +
> > +Compute based emulation of geometry stages
> > +------------------------------------------
> > +
> > +More and more hardware vendors don't bother providing hardware
> > support for
> > +geometry/tesselation/mesh stages, since those can be emulated with
> > compute
> > +shaders. But the same problem we have with tiler memory exists with
> > those
> > +intermediate compute-emulated stages, because transient data shared
> > between
> > +stages need to be stored in memory for the next stage to consume,
> > and this
> > +bubbles up until the tiling stage is reached, because ultimately,
> > what the
> > +tiling stage will need to process is a set of vertices it can turn
> > into
> > +primitives, like would happen if the application had emulated the
> > geometry,
> > +tesselation or mesh stages with compute.
> > +
> > +Unlike tiling, where the hardware can provide a fallback to recycle
> > memory,
> > +there is no way the intermediate primitives can be flushed up to the
> > framebuffer,
> > +because it's a purely software emulation here. This being said, the
> > same
> > +"start small, grow on-demand" can be applied to avoid over-
> > allocating memory
> > +upfront.  
> 
> FWIW, v3d has geometry and tessellation hardware.

Yep, Alyssa mentioned that. I'll change this section to specifically
mention Arm/Mali as being the outlier here.

> 
> 
> > +
> > +On-demand memory allocation
> > +---------------------------
> > +
> > +As explained in previous sections, on-demand allocation is a central
> > piece
> > +of tile-based renderer if we don't want to over-allocate, which is
> > bad for
> > +integrated GPUs who share their memory with the rest of the system.
> > +
> > +The problem with on-demand allocation is that suddenly, GPU accesses
> > can
> > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem
> > mostly)
> > +were not designed for that. Those are assuming that buffers memory
> > is
> > +populated at job submission time, and will stay around for the job
> > lifetime.
> > +If a GPU fault happens, it's the user fault, and the context can be
> > flagged
> > +unusable. On-demand allocation is usually implemented as allocation-
> > on-fault,
> > +and the dma_fence contract prevents us from blocking on allocations
> > in that
> > +path (GPU fault handlers are in the dma-fence signalling path).  
> 
> As I described above, v3d is not quite an allocation-on-fault mechanism
> but rather, we get a dedicated interrupt from the hw when it needs more
> memory, which I believe happens a bit before it completely runs out of
> memory actually. Maybe that changes the picture since we don't exactly
> use a fault handler?

Not really. Any mechanism relying on on-demand allocation in the
dma_fence signalling path is problematic. The fact it's based on a
fault handler might add extra problems on top, but both designs violate
the dma_fence contract stating that no non-fallible allocation should
be done in the dma_fence signalling path (that is, any allocation
happening between the moment the job was queued to the
drm_sched_entity, and the moment the job fence is signalled).

Given, the description you made, I think we can add v3d to the list of
problematic drivers :-(.

Regards,

Boris

Reply via email to