On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <d...@kernel.org> wrote: > > On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote: > > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <d...@kernel.org> wrote: > > > > > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote: > > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <d...@kernel.org> > > > > wrote: > > > > > > > > > > (Cc: Boris) > > > > > > > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote: > > > > > > For some context, other drivers have the concept of a "synchronous" > > > > > > VM_BIND ioctl which completes immediately, and drivers implement it > > > > > > by > > > > > > waiting for the whole thing to finish before returning. > > > > > > > > > > Nouveau implements sync by issuing a normal async VM_BIND and > > > > > subsequently > > > > > waits for the out-fence synchronously. > > > > > > > > As Connor mentioned, we'd prefer it to be async rather than blocking, > > > > in normal cases, otherwise with drm native context for using native > > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender > > > > thread. > > > > > > > > The key is we want to keep it async in the normal cases, and not have > > > > weird edge case CTS tests blow up from being _too_ async ;-) > > > > > > I really wonder why they don't blow up in Nouveau, which also support full > > > asynchronous VM_BIND. Mind sharing which tests blow up? :) > > > > Maybe it was > > dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24, > > The test above is part of the smoke testing I do for nouveau, but I haven't > seen > such issues yet for nouveau.
nouveau is probably not using async binds for everything? Or maybe I'm just pointing to the wrong test. > > but I might be mixing that up, I'd have to back out this patch and see > > where things blow up, which would take many hours. > > Well, you said that you never had this issue with "real" workloads, but only > with VK CTS, so I really think we should know what we are trying to fix here. > > We can't just add new generic infrastructure without reasonable and *well > understood* justification. What is not well understood about this? We need to pre-allocate memory that we likely don't need for pagetables. In the worst case, a large # of async PAGE_SIZE binds, you end up needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page of mapping. Queue up enough of those and you can explode your memory usage. > > There definitely was one where I was seeing >5k VM_BIND jobs pile up, > > so absolutely throttling like this is needed. > > I still don't understand why the kernel must throttle this? If userspace uses > async VM_BIND, it obviously can't spam the kernel infinitely without running > into an OOM case. It is a valid question about whether the kernel or userspace should be the one to do the throttling. I went for doing it in the kernel because the kernel has better knowledge of how much it needs to pre-allocate. (There is also the side point, that this pre-allocated memory is not charged to the calling process from a PoV of memory accounting. So with that in mind it seems like a good idea for the kernel to throttle memory usage.) > But let's assume we agree that we want to avoid that userspace can ever OOM > itself > through async VM_BIND, then the proposed solution seems wrong: > > Do we really want the driver developer to set an arbitrary boundary of a > number > of jobs that can be submitted before *async* VM_BIND blocks and becomes > semi-sync? > > How do we choose this number of jobs? A very small number to be safe, which > scales badly on powerful machines? A large number that scales well on powerful > machines, but OOMs on weaker ones? The way I am using it in msm, the credit amount and limit are in units of pre-allocated pages in-flight. I set the enqueue_credit_limit to 1024 pages, once there are jobs queued up exceeding that limit, they start blocking. The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight. > I really think, this isn't the correct solution, but more a workaround. > > > Part of the VM_BIND for msm series adds some tracepoints for amount of > > memory preallocated vs used for each job. That plus scheduler > > tracepoints should let you see how much memory is tied up in > > prealloc'd pgtables. You might not be noticing only because you are > > running on a big desktop with lots of RAM ;-) > > > > > > > > But this > > > > > > doesn't work for native context, where everything has to be > > > > > > asynchronous, so we're trying a new approach where we instead submit > > > > > > an asynchronous bind for "normal" (non-sparse/driver internal) > > > > > > allocations and only attach its out-fence to the in-fence of > > > > > > subsequent submits to other queues. > > > > > > > > > > This is what nouveau does and I think other drivers like Xe and > > > > > panthor do this > > > > > as well. > > > > > > > > No one has added native context support for these drivers yet > > > > > > Huh? What exactly do you mean with "native context" then? > > > > It is a way to use native usermode driver in a guest VM, by remoting > > at the UAPI level, as opposed to the vk or gl API level. You can > > generally get equal to native performance, but the guest/host boundary > > strongly encourages asynchronous to hide the guest->host latency. > > For the context we're discussing this isn't different to other drivers > supporing > async VM_BIND utilizing it from the host, rather than from a guest. > > So, my original statement about nouveau, Xe, panthor doing the same thing > without running into trouble should be valid. Probably the difference is that we don't do any _synchronous_ binds. And that is partially motivated by the virtual machine case. BR, -R