Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Christian König Tue, 20 Apr 2021 03:15:27 -0700

Am 19.04.21 um 17:48 schrieb Jason Ekstrand:

Not going to comment on everything on the first pass...


On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <mar...@gmail.com> wrote:

Hi,

This is our initial proposal for explicit fences everywhere and new memory 
management that doesn't use BO fences. It's a redesign of how Linux graphics 
drivers work, and it can coexist with what we have now.


1. Introduction
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs with 
only one graphics queue where everything was executed in the submission order 
and per-BO fences were used for memory management and CPU-GPU synchronization, 
not GPU-GPU synchronization. Later, multiple queues were added on top, which 
required the introduction of implicit GPU-GPU synchronization between queues of 
different processes using per-BO fences. Recently, even parallel execution 
within one queue was enabled where a command buffer starts draws and compute 
shaders, but doesn't wait for them, enabling parallelism between back-to-back 
command buffers. Modesetting also uses per-BO fences for scheduling flips. Our 
GPU scheduler was created to enable all those use cases, and it's the only 
reason why the scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory management, 
and the tracking of per-BO fences increase CPU overhead and latency, and reduce 
parallelism. There is a desire to replace all of them with something much 
simpler. Below is how we could do it.


2. Explicit synchronization for window systems and modesetting

The producer is an application and the consumer is a compositor or a 
modesetting driver.

2.1. The Present request

As part of the Present request, the producer will pass 2 fences (sync objects) 
to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the 
producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when the 
consumer has finished using the presented buffer.

I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.

Well that is exactly what our Windows guys have suggested as well, butit strongly looks like that this isn't sufficient.

First of all you run into security problems when any application canjust write any value to that memory location. Just imagine anapplication sets the counter to zero and X waits forever for somerendering to finish.

Additional to that in such a model you can't determine who is the guiltyqueue in case of a hang and can't reset the synchronization primitivesin case of an error.

Apart from that this is rather inefficient, e.g. we don't have any wayto prevent priority inversion when used as a synchronization mechanismbetween different GPU queues.


Christian.

   The kernel then provides some helpers for
waiting on them reliably and without spinning.  I don't expect
everyone to support these right away but, If we're going to re-plumb
userspace for explicit synchronization, I'd like to make sure we take
this into account so we only have to do it once.

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This 
information is part of the Present request and supplied by userspace.

This isn't clear to me.  Yes, if we're using anything dma-fence based
like syncobj, this is true.  But it doesn't seem totally true as a
general statement.

- If the producer crashes, the kernel signals the submit fence, so that the 
consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so that the 
producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.

What do you mean by "all"?  All fences that were supposed to be
signaled by the hung context?

Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be 
supported. A merged fence is signalled only when its fences are signalled. The 
consumer will have the option to redefine the unsignalled return fence to a 
merged fence.

2.2. Modesetting

Since a modesetting driver can also be the consumer, the present ioctl will 
contain a submit fence and a return fence too. One small problem with this is 
that userspace can hang the modesetting driver, but in theory, any later 
present ioctl can override the previous one, so the unsignalled presentation is 
never used.


3. New memory management

The per-BO fences will be removed and the kernel will not know which buffers 
are busy. This will reduce CPU overhead and latency. The kernel will not need 
per-BO fences with explicit synchronization, so we just need to remove their 
last user: buffer evictions. It also resolves the current OOM deadlock.

Is this even really possible?  I'm no kernel MM expert (trying to
learn some) but my understanding is that the use of per-BO dma-fence
runs deep.  I would like to stop using it for implicit synchronization
to be sure, but I'm not sure I believe the claim that we can get rid
of it entirely.  Happy to see someone try, though.

3.1. Evictions

If the kernel wants to move a buffer, it will have to wait for everything to go 
idle, halt all userspace command submissions, move the buffer, and resume 
everything. This is not expected to happen when memory is not exhausted. Other 
more efficient ways of synchronization are also possible (e.g. sync only one 
process), but are not discussed here.

3.2. Per-process VRAM usage quota

Each process can optionally and periodically query its VRAM usage quota and 
change domains of its buffers to obey that quota. For example, a process 
allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1 GB. 
The process can change the domains of the least important buffers to GTT to get 
the best outcome for itself. If the process doesn't do it, the kernel will 
choose which buffers to evict at random. (thanks to Christian Koenig for this 
idea)

This is going to be difficult.  On Intel, we have some resources that
have to be pinned to VRAM and can't be dynamically swapped out by the
kernel.  In GL, we probably can deal with it somewhat dynamically.  In
Vulkan, we'll be entirely dependent on the application to use the
appropriate Vulkan memory budget APIs.

--Jason

3.3. Buffer destruction without per-BO fences

When the buffer destroy ioctl is called, an optional fence list can be passed 
to the kernel to indicate when it's safe to deallocate the buffer. If the fence 
list is empty, the buffer will be deallocated immediately. Shared buffers will 
be handled by merging fence lists from all processes that destroy them. 
Mitigation of malicious behavior:
- If userspace destroys a busy buffer, it will get a GPU page fault.
- If userspace sends fences that never signal, the kernel will have a timeout 
period and then will proceed to deallocate the buffer anyway.

3.4. Other notes on MM

Overcommitment of GPU-accessible memory will cause an allocation failure or 
invoke the OOM killer. Evictions to GPU-inaccessible memory might not be 
supported.

Kernel drivers could move to this new memory management today. Only buffer 
residency and evictions would stop using per-BO fences.


4. Deprecating implicit synchronization

It can be phased out by introducing a new generation of hardware where the 
driver doesn't add support for it (like a driver fork would do), assuming 
userspace has all the changes for explicit synchronization. This could 
potentially create an isolated part of the kernel DRM where all drivers only 
support explicit synchronization.

Marek
_______________________________________________
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Reply via email to