On Mon, Jul 24, 2023 at 9:04 PM Danilo Krummrich <d...@redhat.com> wrote:
> On 7/22/23 17:12, Faith Ekstrand wrote: > > On Wed, Jul 19, 2023 at 7:15 PM Danilo Krummrich <d...@redhat.com > > <mailto:d...@redhat.com>> wrote: > > > > This commit provides the implementation for the new uapi motivated > > by the > > Vulkan API. It allows user mode drivers (UMDs) to: > > > > 1) Initialize a GPU virtual address (VA) space via the new > > DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the portion > > of VA > > space managed by the kernel and userspace, respectively. > > > > 2) Allocate and free a VA space region as well as bind and unbind > memory > > to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl. > > UMDs can request the named operations to be processed either > > synchronously or asynchronously. It supports DRM syncobjs > > (incl. timelines) as synchronization mechanism. The management > > of the > > GPU VA mappings is implemented with the DRM GPU VA manager. > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. > The > > execution happens asynchronously. It supports DRM syncobj (incl. > > timelines) as synchronization mechanism. DRM GEM object locking > is > > handled with drm_exec. > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the > DRM > > GPU scheduler for the asynchronous paths. > > > > > > IDK where the best place to talk about this is but this seems as good as > > any. > > > > I've been looking into why the Vulkan CTS runs about 2x slower for me on > > the new UAPI and I created a little benchmark to facilitate testing: > > > > https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/141 > > <https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/141> > > > > The test, roughly, does the following: > > 1. Allocates and binds 1000 BOs > > 2. Constructs a pushbuf that executes a no-op compute shader. > > 3. Does a single EXEC/wait combo to warm up the kernel > > 4. Loops 10,000 times, doing SYNCOBJ_RESET (fast), EXEC, and then > > SYNCOBJ_WAIT and times the loop > > > > Of course, there's a bit of userspace driver overhead but that's > > negledgable. > > > > If you drop the top patch which allocates 1k buffers, the submit time on > > the old uAPI is 54 us/exec vs. 66 us/exec on the new UAPI. This includes > > the time to do a SYNCOBJ_RESET (fast), EXEC, and SYNCOBJ_WAIT.The Intel > > driver, by comparison, is 33us/exec so it's not syncobj overhead. This > > is a bit concerning (you'd think the new thing would be faster) but what > > really has me concerned is the 1k buffer case. > > > > If you include the top patch in the crucible MR, it allocates 1000 BOs > > and VM_BINDs them. All the binding is done before the warmup EXEC. > > Suddenly, the submit time jumps to 257 us/exec with the new UAPI. The > > old UAPI is much worse (1134 us/exec) but that's not the point. Once > > we've done the first EXEC and created our VM bindings, the cost per EXEC > > shouldn't change at all based on the number of BOs bound. Part of the > > point of VM_BIND is to get all that binding logic and BO walking off the > > EXEC path. > > > > Normally, I wouldn't be too worried about a little performance problem > > like this. This is the first implementation and we can improve it later. > > I get that. However, I suspect the solution to this problem involves > > more UAPI and I want to make sure we have it all before we call this all > > done and dusted and land it. > > > > The way AMD solves this problem as well as the new Xe driver for Intel > > is to have a concept of internal vs. external BOs. Basically, there's an > > INTERNAL bit specified somewhere in BO creation that has a few userspace > > implications: > > 1. In the Xe world where VMs are objects, INTERNAL BOs are assigned a > > VM on creation and can never be bound to any other VM. > > 2. Any attempt to export an INTERNAL BO via prime or a similar > > mechanism will fail with -EINVAL (I think?). > > > > Inside the kernel driver, all the internal BOs on a VM (or DRM file in > > the case of nouveau/AMD since they don't have VM objects) share a single > > dma_resv which allows you to avoid having to walk lists of BOs and take > > locks on every exec. Instead, you can just look at the fences on the > > dma_resv for the VM. There's still a BO list associated with the VM for > > external BOs but, in most Vulkan applications, there are less than a > > half dozen external BOs total. Meanwhile, the hundreds or thousands of > > BOs used entirely internally to the application basically count as one > > BO when it comes to locking overhead. > > I am aware of that and I have some WIP patches [1] to generalize a > common dma-resv within the GPUVA manager which basically represents a > GPU-VM. It also keeps track of external GEM objects and evicted objects, > such that on EXEC we only need to validate objects needing validation, > rather than all of them. Hence, it should be faster than with Daves > patch having a common dma-resv only. > > In [1] I also picked up Daves code to allow for noop jobs to be > submitted as well as the NOUVEAU_GEM_DOMAIN_NO_SHARE flag. > > This seems to work fine with yours and Daves latest mesa work > (670c301a9845a3fc795fd48a1e6714e75b388245). > > Your crucible bench.submit-latency test goes down to 51us on my machine > with those patches. > > I am unsure though, if we should aim for a common solution within the > GPUVA manager directly or if we should do it driver specific in a first > shot. I discussed this patch with Matt and I know that XE looks for > having a generalized solution as well. However, it surely needs some > more care and polish and feedback from other drivers perspective. > > [1] > > https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/new-uapi-drm-next-vm-resv > > > > > I'm not actually trying to dictate kernel driver design here. If one > > dma_resv doesn't work, fine. I don't care as long as EXEC scales. > > However, given that the solution in all the other drivers involves a BO > > create flag nouveau doesn't have, we need to either add that or prove > > that we can get EXEC to scale without it. > > From #nouveau: > > <gfxstrand> CTSing now > <gfxstrand> It looks like it's still going to take 1.5 hours. > > I may have an idea what could be the issue, let me explain. > > Currently, there is a single drm_gpu_scheduler having a drm_sched_entity > per client (for VM_BIND jobs) and a drm_sched_entity per channel (for > EXEC jobs). > > For VM_BIND jobs the corresponding PT[E]s are allocated before the job > is pushed to the corresponding drm_sched_entity. The PT[E]s are freed by > the schedulers free() callback pushing work to a single threaded > workqueue doing the actual free. (We can't do it in the free() callback > directly, since to free PT[E]s we need to hold a mutex we also need to > hold while allocating them.) > > Because of how the page table handling in Nouveau is implemented > currently there are some ordering restrictions when it comes to > allocating and freeing PT[E]s. For instance, we can't allocate PT[E]s > for sparse regions before the PT[E]s of previously removed memory backed > mappings *within the same address range* aren't freed. The same applies > vice versa and for sparse mapping replacing sparse mapping. For memory > backed mappings (also for those within sparse regions) we do *not* have > such ordering requirements. > > So, let's assume userspace removes a sparse region A[0x0, 0x8000000] and > asks for a couple of new memory backed mappings within or crossing this > range; the kernel needs to wait for A not only to be unmapped, but also > the backing PT[E]s to be freed before it can even allocate the PT[E]s > for the new memory backed mappings. > > Now, let's have a look what the gpu schedulers main loop does. Before > picking the next entity to schedule a job for, it tries to fetch the > first job from the pending_list and checks whether its dma-fence is > signaled already and whether the job can be cleaned up. Subsequent jobs > on the pending_list are not taken into consideration. Hence, it might > well be that the first job on the pending_list isn't signaled yet, but > subsequent jobs are and hence *could* be cleaned up. > > Normally, this shouldn't be a problem, since we wouldn't really care > *when* resources are cleaned up as long as they are eventually. However, > with the ordering restrictions the page table handling gives us we might > actually care about the "when". > > For instance, it could happen that the first job on the pending list is > a rather long running EXEC job (1) scheduled from client A on some > channel. The next job on the pending list could be a VM_BIND job (2) > from client B removing a sparse region, which is finished already but is > blocked to be cleaned up until the EXEC job (1) from client A is > finished and cleaned up. Now, a subsequent VM_BIND job (3) from client B > creating a new memory backed mapping in the same address range as the > sparse region removed by job (2) would need to wait for (2) to be > cleaned up. Ultimately, we can expect client B to submit an EXEC job > that needs to wait for the corresponding mappings to be created, namely > the VM_BIND job (3). > > Clearly in order to address this we need to rework the page table > handling in Nouveau to get rid of those ordering restrictions. > > Temporarily, we could also try to run a secondary drm_gpu_scheduler > instance, one for VM_BINDs and one for EXECs maybe... > > However, I would not expect this to be an issue in real applications, > especially if mesa takes a little care not to re-use certain address > space areas right away to avoid running into such wait conditions. > > For parallel VK CTS runs I could imagine that we run into such cases > from time to time though. > Thanks for the detailed write-up! That would definitely explain it. If I remember, I'll try to do a single-threaded run or two. If your theory is correct, there should be no real perf difference when running single-threaded. Those runs will take a long time, though, so I'll have to run them over night. I'll let you know in a few days once I have the results. If this theory holds, then I'm not concerned about the performance of the API itself. It would still be good to see if we can find a way to reduce the cross-process drag in the implementation but that's a perf optimization we can do later. Does it actually matter? Yes, it kinda does. No, it probably doesn't matter for games because you're typically only running one game at a time. From a development PoV, however, if it makes CI take longer then that slows down development and that's not good for the users, either. ~Faith > - Danilo > > > > > ~Faith > > > >