Hi Steve,

On Wed, 23 Apr 2025 10:41:53 +0100
Steven Price <steven.pr...@arm.com> wrote:

> On 18/04/2025 13:25, Boris Brezillon wrote:
> > Tile-based GPUs come with a set of constraints that are not present
> > when immediate rendering is used. This new document tries to explain
> > the differences between tile/immediate rendering, the problems that
> > come with tilers, and how we plan to address them.
> > 
> > This is just a started point, this document will be updated with new
> > materials as we refine the libraries we add to help deal with
> > tilers, and have more drivers converted to follow the rules listed
> > here.
> > 
> > Signed-off-by: Boris Brezillon <boris.brezil...@collabora.com>  
> 
> Seems like a good starting point, a few minor comments below. We really
> need some non-Mali input too though.

Totally agree with that, my view on this problem is certainly biased.

> 
> > ---
> >  Documentation/gpu/drm-tile-based-renderer.rst | 201 ++++++++++++++++++
> >  Documentation/gpu/index.rst                   |   1 +
> >  2 files changed, 202 insertions(+)
> >  create mode 100644 Documentation/gpu/drm-tile-based-renderer.rst
> > 
> > diff --git a/Documentation/gpu/drm-tile-based-renderer.rst 
> > b/Documentation/gpu/drm-tile-based-renderer.rst
> > new file mode 100644
> > index 000000000000..19b56b9476fc
> > --- /dev/null
> > +++ b/Documentation/gpu/drm-tile-based-renderer.rst
> > @@ -0,0 +1,201 @@
> > +==================================================
> > +Infrastructure and tricks for tile-based renderers
> > +==================================================
> > +
> > +All lot of embedded GPUs are using tile-based rendering instead of 
> > immediate
> > +rendering. This mode of rendering has various implications that we try to
> > +document here along with some hints about how to deal with some of the
> > +problems that surface with tile-based renderers.
> > +
> > +The main idea behind tile-based rendering is to batch processing of nearby
> > +pixels during the fragment shading phase to limit the traffic on the memory
> > +bus by making optimal use of the various caches present in the GPU. Unlike
> > +immediate rendering, where primitives generated by the geometry stages of
> > +the pipeline are directly consumed by the fragment stage, tilers have to
> > +record primitives in bins that are somehow attached to tiles (the
> > +granularity of the tile being GPU-specific). This data is usually stored
> > +in memory, and pulled back when the fragment stage is executed.
> > +
> > +This approach has several issues that most drivers need to handle somehow,
> > +sometimes with a bit of help from the hardware.
> > +
> > +Issues at hand
> > +==============
> > +
> > +Tiler memory
> > +------------
> > +
> > +The amount of memory needed to store primitives data and metadata is hard
> > +to guess ahead of time, because it depends on various parameters that are
> > +not in control of the UMD (UserMode Driver). Here is a non-exhaustive list
> > +of things that may complicate the calculation of the memory needed to store
> > +primitive information:
> > +
> > +- Primitives distribution across tiles is hard to guess: the binning 
> > process
> > +  is about assigning each primitive to the set tiles it covers. The more 
> > tiles
> > +  being covered the more memory is needed to record those. We can estimate
> > +  the worst case scenario by assuming all primitives will cover all tiles 
> > but
> > +  this will lead to over-allocation most of the time, which is not good
> > +- Indirect draws: the number of vertices comes from a GPU buffer that might
> > +  be filled by previous GPU compute jobs. This means we only know the 
> > number
> > +  of vertices when the GPU executes the draw, and thus can't guess how much
> > +  memory will be needed for those and allocate a GPU buffer that's big 
> > enough
> > +  to hold those
> > +- Complex geometry pipelines: if you throw geometry/tesselation/mesh 
> > shaders
> > +  it gets even trickier to guess the number of primitives from the number
> > +  of vertices passed to the vertex shader.
> > +
> > +For all these reasons, the tiler usually allocates memory dynamically, but
> > +DRM has not been designed with this use case in mind. Drivers will address
> > +these problems differently based on the functionality provided by their
> > +hardware, but all of them almost certainly have to deal with this somehow.
> > +
> > +The easy solution is to statically allocate a huge buffer to pick from when
> > +tiler memory is needed, and fail the rendering when this buffer is 
> > depleted.
> > +Some drivers try to be smarter to avoid reserving a lot of memory upfront.
> > +Instead, they start with an almost empty buffer and progressively populate 
> > it
> > +when the GPU faults on an address sitting in the tiler buffer range. This
> > +works okay most of the time but it falls short when the system is under
> > +memory pressure, because the memory request is not guaranteed to be 
> > satisfied.
> > +In that case, the driver either fails the rendering, or, if the hardware
> > +allows it, it tries to flush the primitives that have been processed and
> > +triggers a fragment job that will consume those primitives and free up some
> > +memory to be recycled and make further progress on the tiling step. This is
> > +usually referred as partial/incremental rendering (it might have other 
> > names).
> > +
> > +Compute based emulation of geometry stages
> > +------------------------------------------
> > +
> > +More and more hardware vendors don't bother providing hardware support for
> > +geometry/tesselation/mesh stages, since those can be emulated with compute
> > +shaders. But the same problem we have with tiler memory exists with those
> > +intermediate compute-emulated stages, because transient data shared between
> > +stages need to be stored in memory for the next stage to consume, and this
> > +bubbles up until the tiling stage is reached, because ultimately, what the
> > +tiling stage will need to process is a set of vertices it can turn into
> > +primitives, like would happen if the application had emulated the geometry,
> > +tesselation or mesh stages with compute.
> > +
> > +Unlike tiling, where the hardware can provide a fallback to recycle memory,
> > +there is no way the intermediate primitives can be flushed up to the 
> > framebuffer,
> > +because it's a purely software emulation here. This being said, the same
> > +"start small, grow on-demand" can be applied to avoid over-allocating 
> > memory
> > +upfront.
> > +
> > +On-demand memory allocation
> > +---------------------------
> > +
> > +As explained in previous sections, on-demand allocation is a central piece
> > +of tile-based renderer if we don't want to over-allocate, which is bad for
> > +integrated GPUs who share their memory with the rest of the system.
> > +
> > +The problem with on-demand allocation is that suddenly, GPU accesses can
> > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
> > +were not designed for that. Those are assuming that buffers memory is  
> 
> NIT: s/buffers/buffer's/
> 
> > +populated at job submission time, and will stay around for the job 
> > lifetime.
> > +If a GPU fault happens, it's the user fault, and the context can be 
> > flagged  
> 
> NIT: s/user/user's/
> 
> > +unusable. On-demand allocation is usually implemented as 
> > allocation-on-fault,
> > +and the dma_fence contract prevents us from blocking on allocations in that
> > +path (GPU fault handlers are in the dma-fence signalling path). So now we
> > +have GPU allocations that will be satisfied most of the time, but can fail
> > +occasionally. And this is not great, because an allocation failure might
> > +kill the user GPU context (VK_DEVICE_LOST in Vulkan terms), without the
> > +application having dong anything wrong. So, we need something that makes 
> > those
> > +allocation failures rare enough that most users won't experience them, and
> > +we need a fallback for when this happens to try to avoid them on the next
> > +user attempt to submit a graphics job.
> > +
> > +The plan
> > +========
> > +
> > +On-demand allocation rules
> > +--------------------------
> > +
> > +First of all, all allocations happening in the fault handler path must
> > +be using GFP_NOWAIT. With this flag, low-hanging fruit can be picked
> > +(clean FS cache will be flushed for instance), but an error will be
> > +returned if no memory is readily available. GFP_NOWAIT will also trigger
> > +background reclaim to hopefully free-up some memory for our future
> > +requests.
> > +
> > +How to deal with allocation failures
> > +------------------------------------
> > +
> > +The first trick here is to try to guess approximately how much memory
> > +will be needed, and force-populate on-demand buffers with that amount
> > +of memory when the job is started. It's not about guessing the worst
> > +case scenario here, but more the most likely case, probably with a
> > +reasonable margin, so that the job is likely to succeed when this amount
> > +of memory is provided by the KMD.
> > +
> > +The second trick to try to avoid over-allocation, even with this
> > +sub-optimistic estimate, is to have a shared pool of memory that can be
> > +used by all GPU contexts when they need tiler/geometry memory. This
> > +implies returning chunks to this pool at some point, so other contexts
> > +can re-use those. Details about what this global memory pool implementation
> > +would look like is currently undefined, but it needs to be filled to
> > +guarantee that pre-allocation requests for on-demand buffers used by a
> > +GPU job can be satisfied in the fault handler path.  
> 
> Note one thing I haven't seen discussed is that across multiple contexts
> it's possible to prioritise jobs that free memory. E.g. a fragment job
> can be run to free up memory from a tiler heap, allowing pages to be
> returned to the global pool. This might imply a uAPI extension allowing
> a fragment job to automatically drop memory from a BO so that the kernel
> can have confidence that it will actually free up memory.
> 
> Sadly I don't think it's plausible to wait in the fault handler for a
> fragment job to complete to free up memory - so the best we can do here
> is postpone *starting* a vertex+tiler job if we're short on memory and
> have fragment jobs to run.

Right, we'll have to do with an internal dma_fence (returned
through drm_sched_ops::prepare_job()) that's controlling access to this
memory pool, so we're sure all currently queued tiler jobs (those
passed to ::run_job()) can have their estimated memory allocation
satisfied. But because it's just an estimate, there's still no guarantee
that the job won't try to allocate more, and thus no guarantee that
the job will always succeed.

> 
> > +
> > +As a last resort, we can try to allocate with GFP_ATOMIC if everything
> > +else fails, but this is a dangerous game, because we would be stealing
> > +memory from the atomic reserve, so it's not entirely clear if this is
> > +better than failing the job at this point.
> > +
> > +Ideas on how to make allocation failures decrease over time
> > +-----------------------------------------------------------
> > +
> > +When an on-demand allocation fails and the hardware doesn't have a
> > +flush-primitives fallback, we usually can't do much apart from failing the
> > +whole job. But it's important to try to avoid future allocation failures
> > +when the application creates a new context. There's no clear path for
> > +how to guess the actual size to force-populate on the next attempt. One
> > +option is to have a simple heuristics, like double the current resident 
> > size,
> > +but this has the downside of potentially taking a few attempts before 
> > reaching
> > +the stability point. Another option is to repeatedly map a dummy page at 
> > the
> > +fault addresses, so we can get a sense of how much memory was needed for 
> > this
> > +particular job.  
> 
> We'd have to double check that we don't cause extra problems with an
> aliasing heap like that. The tiler might attempt to read back data which
> could cause 'interesting' errors if it's getting clobbered.

Yeah, I thought about that too :-(.

> Given this
> is just a heuristic it might be ok, but it definitely needs more research.

I should probably make it clear that these options are based on
speculations about how the HW works, and they might prove impossible to
implement in practice. The reason I have them listed here is so Sima's
suggestions don't get lost in the original thread.

Thanks for reviewing the piece of doc. I'll leave a bit more time for
others to chime in, and post of v2 addressing your comments.

Boris

Reply via email to