On Mon, Aug 19, 2024 at 10:00 AM Bas Nieuwenhuizen <b...@basnieuwenhuizen.nl> wrote:
> > > On Mon, Aug 19, 2024 at 4:51 PM Christian König < > ckoenig.leichtzumer...@gmail.com> wrote: > >> Am 07.08.24 um 22:33 schrieb Bas Nieuwenhuizen: >> >> On Wed, Aug 7, 2024 at 10:25 PM Faith Ekstrand <fa...@gfxstrand.net> >> wrote: >> >>> On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <jos...@froggi.es> wrote: >>> >>>> I was thinking about this more recently. I was initially considering >>>> "maybe this should be a per-BO import," but I couldn't think of anything in >>>> the GL model that would actually benefit given its not "true" bindless and >>>> there's no update-after-bind there. >>>> >>> >>> That's also an option and it's the way it works on i915. However, then >>> you have to pass lists of things to the kernel and that's kinda gross. If >>> we need it, we can do that. Otherwise, I think a more global thing is >>> fine. I think Bas is currently investigating a per-submit approach which >>> is a tad different from either but should also work okay. >>> >>> >> >> Yeah, I'm working on a per-submit thing (also using BOOKKEEP fences >> instead of using the EXPLICIT wait mode to ensure we also don't add >> implicit fences). >> >> >> Yeah agree. Your implementation with the per submission flag and using >> BOOKKEEP actually sounds like the right thing to do to me as well. >> >> We need to keep in mind that synchronization goes in both ways, e.g. >> explicit->implicit as well as implicit->explicit. >> >> I would rather like to keep the implicit->explicit handling (which this >> patch here completely disables) and only allow the explicit->implicit >> handling (which by using BOOKKEEP fences). >> >> This way it is possible that we still over synchronize for example for a >> double buffered BO is re-used by an explicit client and implicit display >> server, but that's probably not something we should optimize in the first >> place. >> > > This oversynchronization actually happens easily as in bindless Vulkan we > have to mark all buffers as "used". We have some hacks to avoid the worst > of it but it can be pretty meh. > Yeah, this case is actually really important. When I initially did the dma-buf fence import/export work on Intel, it was a massive speed-up in DOOM 2016, precisely from removing this bit of over-sync. ~Faith > In my series on the ML[1] I think I actually got both sides by waiting on > KERNEL fences only and setting BOOKKEEP fences. (Yeah it actually ends up > kinda orthogonal on the sync mode but it is what it is ...). > > - Bas > > [1]https://patchwork.freedesktop.org/series/137014/ > > >> Regards, >> Christian. >> >> >> We do have a per-BO list on submission already, so we could add things >> there, it is just very annoying to implement as currently at the point we >> do fence wait/signal we lost the association with the BO list. Given that >> I don't see an use case anytime soon (there are some theoreticals like >> radeonsi might start doing READ usage instead of RW usage with extra >> tracking) I feel it isn't worth that added implementation complexity >> >> >> Worth others more familiar with GL asking that question to themselves >>>> also. I am definitely not totally up on what's possible there. >>>> >>>> Overall, I think I am OK with this approach, even though I think mixing >>>> implicit and explicit sync is gross, and I want the pain that is implicit >>>> sync to just go away forever. :-) >>>> >>> >>> So say we all... >>> >>> ~Faith >>> >>> >>> >>>> - Joshie 🐸✨ >>>> >>>> >>>> On August 7, 2024 4:39:32 PM GMT+01:00, Faith Ekstrand < >>>> fa...@gfxstrand.net> wrote: >>>> >Previously, AMDGPU_GEM_CREATE_EXPLICIT_SYNC was used to disable >>>> implicit >>>> >synchronization on BOs when explicit synchronization can be used. The >>>> >problem is that this flag is per-BO and affects all amdgpu users in the >>>> >system, not just the usermode drver which sets it. This can lead to >>>> >some unintended consequences for userspace if not used carefully. >>>> > >>>> >Since the introduction of DMA_BUF_IOCTL_EXPORT_SYNC_FILE and >>>> >DMA_BUF_IOCTL_IMPORT_SYNC_FILE, many userspace window system components >>>> >have grown the ability to convert between the Vulkan explicit sync >>>> model >>>> >and the legacy implicit sync model used by X11 and Wayland in the past. >>>> >This allows both old and new components to exist simultaneously and >>>> talk >>>> >to each other. In particular, XWayland is able to convert between the >>>> >two to let Vulkan apps work seamlessly with older X11 compositors that >>>> >aren't aware of explicit synchronizaiton. This is rapidly becoming the >>>> >backbone of synchronization in the Linux window system space. >>>> > >>>> >Unfortunately, AMDGPU_GEM_CREATE_EXPLICIT_SYNC breaks this because it >>>> >disables all implicit synchronization on the given BO, regardless of >>>> >which userspace driver is rendering with it and regardless of the >>>> >assumptions made by the client application. In particular, this is >>>> >causing issues with RADV and radeonsi. RADV sets the flag to disable >>>> >implicit sync on window system buffers so that it can safely have them >>>> >resident at all times without causing internal over-synchronization. >>>> >The BO is then handed off to a compositor which composites using >>>> >radeonsi. If the compositor uses the EGL_ANDROID_native_fence_sync >>>> >extension to pass explicit sync files through to radeonsi, then >>>> >everything is fine. However, if that buffer ever gets handed to an >>>> >instance of radeonsi with any assumption of implicit synchronization, >>>> >radeonsi won't be able sync on the BO because RADV disabled implict >>>> sync >>>> >on that BO system-wide. It doesn't matter whether some window system >>>> >component used DMA_BUF_IOCTL_IMPORT_SYNC_FILE to set the appropriate >>>> >fence on the BO, amdgpu will ignore it. >>>> > >>>> >This new flag disables implicit sync on the context rather than the BO. >>>> >This way RADV can disable implicit sync (which is what RADV has always >>>> >wanted) without affecting other components in the system. If RADV (or >>>> >some other driver) wants implicit sync on some BO, it can use >>>> >DMA_BUF_IOCTL_EXPORT_SYNC_FILE and DMA_BUF_IOCTL_IMPORT_SYNC_FILE to >>>> >manually synchronize with other implicit-sync components. This is the >>>> >approach we've taken with NVK/nouveau, ANV/xe, and similar to the >>>> >approach taken by ANV/i915 and it works well for those drivers. >>>> > >>>> >Ideally, I would like to see something like this back-ported to at >>>> least >>>> >the kernel where DMA_BUF_IOCTL_IMPORT/EXPORT_SYNC_FILE were introduced >>>> >so that we don't have to wait another year for the fix to reach users. >>>> >However, I understand that back-porting UAPI is problematic and I'll >>>> >leave that decision up to the amdgpu maintainers. Michel suggested >>>> that >>>> >a new CTX_OP would make more sense if we want to back-port it but the >>>> >create flag made more sense to me from an API design PoV. >>>> > >>>> >Signed-off-by: Faith Ekstrand <faith.ekstr...@collabora.com> >>>> >Cc: Alex Deucher <alexander.deuc...@amd.com> >>>> >Cc: Christian König <christian.koe...@amd.com> >>>> >Cc: David Airlie <airl...@gmail.com> >>>> >Cc: Michel Dänzer <mdaen...@redhat.com> >>>> >Cc: Bas Nieuwenhuizen <b...@basnieuwenhuizen.nl> >>>> >--- >>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 ++- >>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ++++++++---- >>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 7 +++++++ >>>> > include/uapi/drm/amdgpu_drm.h | 12 +++++++++++- >>>> > 4 files changed, 28 insertions(+), 6 deletions(-) >>>> > >>>> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> >index ec888fc6ead8..8410b4426541 100644 >>>> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> >@@ -1196,7 +1196,8 @@ static int amdgpu_cs_sync_rings(struct >>>> amdgpu_cs_parser *p) >>>> > struct dma_resv *resv = bo->tbo.base.resv; >>>> > enum amdgpu_sync_mode sync_mode; >>>> > >>>> >- sync_mode = amdgpu_bo_explicit_sync(bo) ? >>>> >+ sync_mode = (amdgpu_ctx_explicit_sync(p->ctx) || >>>> >+ amdgpu_bo_explicit_sync(bo)) ? >>>> > AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER; >>>> > r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode, >>>> > &fpriv->vm); >>>> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> >index 5cb33ac99f70..a304740ccedf 100644 >>>> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> >@@ -318,7 +318,8 @@ static int amdgpu_ctx_get_stable_pstate(struct >>>> amdgpu_ctx *ctx, >>>> > } >>>> > >>>> > static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t >>>> priority, >>>> >- struct drm_file *filp, struct amdgpu_ctx >>>> *ctx) >>>> >+ uint32_t flags, struct drm_file *filp, >>>> >+ struct amdgpu_ctx *ctx) >>>> > { >>>> > struct amdgpu_fpriv *fpriv = filp->driver_priv; >>>> > u32 current_stable_pstate; >>>> >@@ -334,6 +335,7 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr >>>> *mgr, int32_t priority, >>>> > ctx->mgr = mgr; >>>> > spin_lock_init(&ctx->ring_lock); >>>> > >>>> >+ ctx->flags = flags; >>>> > ctx->reset_counter = atomic_read(&mgr->adev->gpu_reset_counter); >>>> > ctx->reset_counter_query = ctx->reset_counter; >>>> > ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm); >>>> >@@ -474,6 +476,7 @@ static int amdgpu_ctx_alloc(struct amdgpu_device >>>> *adev, >>>> > struct amdgpu_fpriv *fpriv, >>>> > struct drm_file *filp, >>>> > int32_t priority, >>>> >+ uint32_t flags, >>>> > uint32_t *id) >>>> > { >>>> > struct amdgpu_ctx_mgr *mgr = &fpriv->ctx_mgr; >>>> >@@ -493,7 +496,7 @@ static int amdgpu_ctx_alloc(struct amdgpu_device >>>> *adev, >>>> > } >>>> > >>>> > *id = (uint32_t)r; >>>> >- r = amdgpu_ctx_init(mgr, priority, filp, ctx); >>>> >+ r = amdgpu_ctx_init(mgr, priority, flags, filp, ctx); >>>> > if (r) { >>>> > idr_remove(&mgr->ctx_handles, *id); >>>> > *id = 0; >>>> >@@ -666,7 +669,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void >>>> *data, >>>> > struct drm_file *filp) >>>> > { >>>> > int r; >>>> >- uint32_t id, stable_pstate; >>>> >+ uint32_t id, stable_pstate, flags; >>>> > int32_t priority; >>>> > >>>> > union drm_amdgpu_ctx *args = data; >>>> >@@ -675,6 +678,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void >>>> *data, >>>> > >>>> > id = args->in.ctx_id; >>>> > priority = args->in.priority; >>>> >+ flags = args->in.flags; >>>> > >>>> > /* For backwards compatibility, we need to accept ioctls with >>>> garbage >>>> > * in the priority field. Garbage values in the priority field, >>>> result >>>> >@@ -685,7 +689,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void >>>> *data, >>>> > >>>> > switch (args->in.op) { >>>> > case AMDGPU_CTX_OP_ALLOC_CTX: >>>> >- r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, &id); >>>> >+ r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, >>>> flags, &id); >>>> > args->out.alloc.ctx_id = id; >>>> > break; >>>> > case AMDGPU_CTX_OP_FREE_CTX: >>>> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h >>>> >index 85376baaa92f..9431c8d2ea59 100644 >>>> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h >>>> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h >>>> >@@ -45,6 +45,7 @@ struct amdgpu_ctx_entity { >>>> > struct amdgpu_ctx { >>>> > struct kref refcount; >>>> > struct amdgpu_ctx_mgr *mgr; >>>> >+ uint32_t flags; >>>> > unsigned reset_counter; >>>> > unsigned reset_counter_query; >>>> > uint64_t generation; >>>> >@@ -84,6 +85,12 @@ struct dma_fence *amdgpu_ctx_get_fence(struct >>>> amdgpu_ctx *ctx, >>>> > bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio); >>>> > void amdgpu_ctx_priority_override(struct amdgpu_ctx *ctx, int32_t >>>> ctx_prio); >>>> > >>>> >+static inline bool >>>> >+amdgpu_ctx_explicit_sync(struct amdgpu_ctx *ctx) >>>> >+{ >>>> >+ return ctx->flags & AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC; >>>> >+} >>>> >+ >>>> > int amdgpu_ctx_ioctl(struct drm_device *dev, void *data, >>>> > struct drm_file *filp); >>>> > >>>> >diff --git a/include/uapi/drm/amdgpu_drm.h >>>> b/include/uapi/drm/amdgpu_drm.h >>>> >index 96e32dafd4f0..e9d87a6e3d86 100644 >>>> >--- a/include/uapi/drm/amdgpu_drm.h >>>> >+++ b/include/uapi/drm/amdgpu_drm.h >>>> >@@ -125,7 +125,14 @@ extern "C" { >>>> > #define AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS (1 << 5) >>>> > /* Flag that BO is always valid in this VM */ >>>> > #define AMDGPU_GEM_CREATE_VM_ALWAYS_VALID (1 << 6) >>>> >-/* Flag that BO sharing will be explicitly synchronized */ >>>> >+/* Flag that BO sharing will be explicitly synchronized >>>> >+ * >>>> >+ * This flag should not be used unless the client can guarantee that >>>> no >>>> >+ * other driver which ever touches this BO will ever want to use >>>> implicit >>>> >+ * synchronization as it disables implicit sync on this BO >>>> system-wide. >>>> >+ * Instead, drivers which use an explicit synchronization model should >>>> >+ * prefer AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC. >>>> >+ */ >>>> > #define AMDGPU_GEM_CREATE_EXPLICIT_SYNC (1 << 7) >>>> > /* Flag that indicates allocating MQD gart on GFX9, where the mtype >>>> > * for the second page onward should be set to NC. It should never >>>> >@@ -240,6 +247,9 @@ union drm_amdgpu_bo_list { >>>> > #define AMDGPU_CTX_OP_GET_STABLE_PSTATE 5 >>>> > #define AMDGPU_CTX_OP_SET_STABLE_PSTATE 6 >>>> > >>>> >+/* indicate that all synchronization will be explicit */ >>>> >+#define AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC (1<<0) >>>> >+ >>>> > /* GPU reset status */ >>>> > #define AMDGPU_CTX_NO_RESET 0 >>>> > /* this the context caused it */ >>>> >>> >>