On Sun, Jul 30, 2023 at 10:30 PM Faith Ekstrand <fa...@gfxstrand.net> wrote:
> > On Tue, Jul 25, 2023 at 5:48 PM Danilo Krummrich <d...@redhat.com> wrote: > >> On 7/25/23 18:43, Danilo Krummrich wrote: >> > On 7/25/23 18:16, Faith Ekstrand wrote: >> >> Thanks for the detailed write-up! That would definitely explain it. If >> >> I remember, I'll try to do a single-threaded run or two. If your >> >> theory is correct, there should be no real perf difference when >> >> running single-threaded. Those runs will take a long time, though, so >> >> I'll have to run them over night. I'll let you know in a few days once >> >> I have the results. >> > >> > I can also push a separate branch where I just print out a warning >> > whenever we run into such a condition including the time we were >> waiting >> > for things to complete. I can probably push something later today. >> >> >> https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/new-uapi-drm-next-track-stalls >> >> It prints out the duration of every wait as well as the total wait time >> since boot. >> >> - Danilo >> >> > >> >> >> >> If this theory holds, then I'm not concerned about the performance of >> >> the API itself. It would still be good to see if we can find a way to >> >> reduce the cross-process drag in the implementation but that's a perf >> >> optimization we can do later. >> > >> > From the kernel side I think the only thing we could really do is to >> > temporarily run a secondary drm_gpu_scheduler instance, one for >> VM_BINDs >> > and one for EXECs until we got the new page table handling in place. >> > >> > However, the UMD could avoid such conditions more effectively, since it >> > controls the address space. Namely, avoid re-using the same region of >> > the address space right away in certain cases. For instance, instead of >> > replacing a sparse region A[0x0, 0x4000000] with a larger sparse region >> > B[0x0, 0x8000000], replace it with B'[0x4000000, 0xC000000] if possible. >> > >> > However, just mentioning this for completeness. The UMD surely >> shouldn't >> > probably even temporarily work around such a kernel limitation. >> > >> > Anyway, before doing any of those, let's see if the theory holds and >> > we're actually running into such cases. >> > >> >> >> >> Does it actually matter? Yes, it kinda does. No, it probably doesn't >> >> matter for games because you're typically only running one game at a >> >> time. From a development PoV, however, if it makes CI take longer then >> >> that slows down development and that's not good for the users, either. >> > > I've run a bunch of tests over the weekend and It's starting to look like > the added CTS time might be from the newly enabled synchronization tests > themselves. We enabled timeline semaphores as well as semaphore and fence > sharing along with the new uAPI and I did not expect those to be nearly > that heavy-weight so I didn't think of that as a likely factor. I'm doing a > couple more runs to confirm but what I'm seeing right now seems to indicate > that the new API with the old feature set has about the same run time now > that the submit overhead issue has been fixed. > My last two runs have come back and confirmed this theory. With the BO fixes, all remaining slow-downs are coming from newly added tests which turn out to be very slow tests. ~Faith > I think this means that we can proceed under the assumption that there are > no more serious perf issues with the new API. I'm happy to RB/ACK the next > version of the API and implementation patches, as long as it has the new > NO_SHARE BO create flag and properly rejects exports of NO_SHARE BOs, even > if not all of the dma_resv plumbing is fully baked. > > ~Faith >