On Sun, 23 Jul 2023 at 01:12, Faith Ekstrand <fa...@gfxstrand.net> wrote: > > On Wed, Jul 19, 2023 at 7:15 PM Danilo Krummrich <d...@redhat.com> wrote: >> >> This commit provides the implementation for the new uapi motivated by the >> Vulkan API. It allows user mode drivers (UMDs) to: >> >> 1) Initialize a GPU virtual address (VA) space via the new >> DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion of VA >> space managed by the kernel and userspace, respectively. >> >> 2) Allocate and free a VA space region as well as bind and unbind memory >> to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl. >> UMDs can request the named operations to be processed either >> synchronously or asynchronously. It supports DRM syncobjs >> (incl. timelines) as synchronization mechanism. The management of the >> GPU VA mappings is implemented with the DRM GPU VA manager. >> >> 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. The >> execution happens asynchronously. It supports DRM syncobj (incl. >> timelines) as synchronization mechanism. DRM GEM object locking is >> handled with drm_exec. >> >> Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the DRM >> GPU scheduler for the asynchronous paths. > > > IDK where the best place to talk about this is but this seems as good as any. > > I've been looking into why the Vulkan CTS runs about 2x slower for me on the > new UAPI and I created a little benchmark to facilitate testing: > > https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/141 > > The test, roughly, does the following: > 1. Allocates and binds 1000 BOs > 2. Constructs a pushbuf that executes a no-op compute shader. > 3. Does a single EXEC/wait combo to warm up the kernel > 4. Loops 10,000 times, doing SYNCOBJ_RESET (fast), EXEC, and then > SYNCOBJ_WAIT and times the loop > > Of course, there's a bit of userspace driver overhead but that's negledgable. > > If you drop the top patch which allocates 1k buffers, the submit time on the > old uAPI is 54 us/exec vs. 66 us/exec on the new UAPI. This includes the time > to do a SYNCOBJ_RESET (fast), EXEC, and SYNCOBJ_WAIT. The Intel driver, by > comparison, is 33us/exec so it's not syncobj overhead. This is a bit > concerning (you'd think the new thing would be faster) but what really has me > concerned is the 1k buffer case. > > If you include the top patch in the crucible MR, it allocates 1000 BOs and > VM_BINDs them. All the binding is done before the warmup EXEC. Suddenly, the > submit time jumps to 257 us/exec with the new UAPI. The old UAPI is much > worse (1134 us/exec) but that's not the point. Once we've done the first EXEC > and created our VM bindings, the cost per EXEC shouldn't change at all based > on the number of BOs bound. Part of the point of VM_BIND is to get all that > binding logic and BO walking off the EXEC path. > > Normally, I wouldn't be too worried about a little performance problem like > this. This is the first implementation and we can improve it later. I get > that. However, I suspect the solution to this problem involves more UAPI and > I want to make sure we have it all before we call this all done and dusted > and land it. > > The way AMD solves this problem as well as the new Xe driver for Intel is to > have a concept of internal vs. external BOs. Basically, there's an INTERNAL > bit specified somewhere in BO creation that has a few userspace implications: > 1. In the Xe world where VMs are objects, INTERNAL BOs are assigned a VM on > creation and can never be bound to any other VM. > 2. Any attempt to export an INTERNAL BO via prime or a similar mechanism > will fail with -EINVAL (I think?).
Okay I've done a first pass at implementing this, https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/005a0005ff80ec85f3e9a314cae0576b7441076c Danilo we should probably pick that up for the next iteration. with corresponding mesa work it gets the latency for me from 180us down to 70us. Dave.