nouveau: implement new VM_BIND uAPI

Dave Airlie Sun, 23 Jul 2023 16:54:56 -0700

On Sun, 23 Jul 2023 at 01:12, Faith Ekstrand <fa...@gfxstrand.net> wrote:
>
> On Wed, Jul 19, 2023 at 7:15 PM Danilo Krummrich <d...@redhat.com> wrote:
>>
>> This commit provides the implementation for the new uapi motivated by the
>> Vulkan API. It allows user mode drivers (UMDs) to:
>>
>> 1) Initialize a GPU virtual address (VA) space via the new
>>    DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion of VA
>>    space managed by the kernel and userspace, respectively.
>>
>> 2) Allocate and free a VA space region as well as bind and unbind memory
>>    to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl.
>>    UMDs can request the named operations to be processed either
>>    synchronously or asynchronously. It supports DRM syncobjs
>>    (incl. timelines) as synchronization mechanism. The management of the
>>    GPU VA mappings is implemented with the DRM GPU VA manager.
>>
>> 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. The
>>    execution happens asynchronously. It supports DRM syncobj (incl.
>>    timelines) as synchronization mechanism. DRM GEM object locking is
>>    handled with drm_exec.
>>
>> Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the DRM
>> GPU scheduler for the asynchronous paths.
>
>
> IDK where the best place to talk about this is but this seems as good as any.
>
> I've been looking into why the Vulkan CTS runs about 2x slower for me on the 
> new UAPI and I created a little benchmark to facilitate testing:
>
> https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/141
>
> The test, roughly, does the following:
>  1. Allocates and binds 1000 BOs
>  2. Constructs a pushbuf that executes a no-op compute shader.
>  3. Does a single EXEC/wait combo to warm up the kernel
>  4. Loops 10,000 times, doing SYNCOBJ_RESET (fast), EXEC, and then 
> SYNCOBJ_WAIT and times the loop
>
> Of course, there's a bit of userspace driver overhead but that's negledgable.
>
> If you drop the top patch which allocates 1k buffers, the submit time on the 
> old uAPI is 54 us/exec vs. 66 us/exec on the new UAPI. This includes the time 
> to do a SYNCOBJ_RESET (fast), EXEC, and SYNCOBJ_WAIT. The Intel driver, by 
> comparison, is 33us/exec so it's not syncobj overhead. This is a bit 
> concerning (you'd think the new thing would be faster) but what really has me 
> concerned is the 1k buffer case.
>
> If you include the top patch in the crucible MR, it allocates 1000 BOs and 
> VM_BINDs them. All the binding is done before the warmup EXEC. Suddenly, the 
> submit time jumps to 257 us/exec with the new UAPI. The old UAPI is much 
> worse (1134 us/exec) but that's not the point. Once we've done the first EXEC 
> and created our VM bindings, the cost per EXEC shouldn't change at all based 
> on the number of BOs bound.  Part of the point of VM_BIND is to get all that 
> binding logic and BO walking off the EXEC path.
>
> Normally, I wouldn't be too worried about a little performance problem like 
> this. This is the first implementation and we can improve it later. I get 
> that. However, I suspect the solution to this problem involves more UAPI and 
> I want to make sure we have it all before we call this all done and dusted 
> and land it.
>
> The way AMD solves this problem as well as the new Xe driver for Intel is to 
> have a concept of internal vs. external BOs. Basically, there's an INTERNAL 
> bit specified somewhere in BO creation that has a few userspace implications:
>  1. In the Xe world where VMs are objects, INTERNAL BOs are assigned a VM on 
> creation and can never be bound to any other VM.
>  2. Any attempt to export an INTERNAL BO via prime or a similar mechanism 
> will fail with -EINVAL (I think?).


Okay I've done a first pass at implementing this,

https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/005a0005ff80ec85f3e9a314cae0576b7441076c

Danilo we should probably pick that up for the next iteration. with
corresponding mesa work it gets the latency for me from 180us down to
70us.

Dave.

Re: [PATCH drm-misc-next v8 11/12] drm/nouveau: implement new VM_BIND uAPI

Reply via email to