[Public] You try to add one new interface to inform mes about the context flush after driver side finish process termination , from my understanding, mes already know the process context need to be purged after all the related queues been removed even without this notification. What do you expect mes to do about this context flush flag ? Mes should block this process context for next set_sched command? Mes can achieve this by ignore the set_sched command with trap disable parameter on an invalid process context .
Shaoyun.liu Get Outlook for iOS<https://aka.ms/o0ukef> ________________________________ From: Kim, Jonathan <jonathan....@amd.com> Sent: Tuesday, December 12, 2023 8:19:09 PM To: Liu, Shaoyun <shaoyun....@amd.com>; Huang, JinHuiEric <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org> Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -----Original Message----- > From: Liu, Shaoyun <shaoyun....@amd.com> > Sent: Tuesday, December 12, 2023 7:08 PM > To: Kim, Jonathan <jonathan....@amd.com>; Huang, JinHuiEric > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > <harish.kasiviswanat...@amd.com> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > I see, so the problem is after process context , set_shader been called > with > disable parameter, do you know the reason why MES re-added the > process context into the list ? Because MES has no idea what disable means. All it knows is that without the flush flag, set_shader should update the necessary per-VMID (process) registers as requested by the driver, which requires persistent per-process HW settings so that potential future waves can inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why ADD_QUEUE auto clears the process context otherwise is another long story, basically an unsolvable MES cache bug problem). Common use case example: add_queue -> set_shader call either transiently stalls the SPI per-VMID or transiently dequeues the HWS per-VMID depending on the request settings -> fulfils the per-VMID register write updates -> resumes process queues so that potential waves on those queues inherit new debug settings. You can't do this kind of operation at the queue level alone. The problem that this patch solves (along with the MES FW upgrade) is an unfortunate quirk of having to operate between process (debug requests) and queue space (non-debug requests). Old HWS used to operate at the per-process level via MAP_PROCESS so it was a lot easier to balance debug versus non-debug requests back then (but it was also lot less efficient performance wise). Jon > > Shaoyun.liu > > -----Original Message----- > From: Kim, Jonathan <jonathan....@amd.com> > Sent: Tuesday, December 12, 2023 6:07 PM > To: Liu, Shaoyun <shaoyun....@amd.com>; Huang, JinHuiEric > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > <harish.kasiviswanat...@amd.com> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -----Original Message----- > > From: Liu, Shaoyun <shaoyun....@amd.com> > > Sent: Tuesday, December 12, 2023 5:44 PM > > To: Kim, Jonathan <jonathan....@amd.com>; Huang, JinHuiEric > > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > <harish.kasiviswanat...@amd.com> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > Do you mean SET_SHADER_DEBUGER can be called before ADD_QUEUE ? > I > > think even in that situation MES should still be able to handle it > > as long as MES already remove the process context from its list , MES > > will treat the process context as a new item. I still don't understand why > MES haven't > > purged the process context from the list after process termination . Will > > debug queue itself also use the add/remove queue interface and is > > it possible the debug queue itself from the old process still not be > > removed ? > > SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE. > The process list is updated on either on SET_SHADER_DEBUGGER or > ADD_QUEUE. > e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list > purged) -> runtime_disable (set_shader process re-added) -> process > termination (stale list) or debug attach (set_shader) -> add_queue -> > remove_queue (list purged) -> debug detach (set_shader process re-added) - > >process termination (stale list) > > MES has no idea what process termination means. The new flag is a proxy > for this. > There are reasons for process settings to take place prior to queue add > (debugger, gfx11 cwsr workaround, core dump etc need this). > > I'm not sure what kernel/debug queues have to do with this. > By that argument, the list should be purged. > > Jon > > > > > Shaoyun.liu > > > > -----Original Message----- > > From: Kim, Jonathan <jonathan....@amd.com> > > Sent: Tuesday, December 12, 2023 4:48 PM > > To: Liu, Shaoyun <shaoyun....@amd.com>; Huang, JinHuiEric > > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > <harish.kasiviswanat...@amd.com> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > > -----Original Message----- > > > From: Liu, Shaoyun <shaoyun....@amd.com> > > > Sent: Tuesday, December 12, 2023 4:45 PM > > > To: Kim, Jonathan <jonathan....@amd.com>; Huang, JinHuiEric > > > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > > <harish.kasiviswanat...@amd.com> > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > > management > > > > > > [Public] > > > > > > Shouldn't the driver side remove all the remaining queues for the > > > process during process termination ? If all the queues been > > > removed for the process , MES should purge the process context > > > automatically , otherwise it's bug inside MES . > > > > That's only if there were queues added to begin with. > > > > Jon > > > > > > > > Regard > > > Sshaoyun.liu > > > > > > -----Original Message----- > > > From: Kim, Jonathan <jonathan....@amd.com> > > > Sent: Tuesday, December 12, 2023 4:33 PM > > > To: Liu, Shaoyun <shaoyun....@amd.com>; Huang, JinHuiEric > > > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > > <harish.kasiviswanat...@amd.com> > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > > management > > > > > > [Public] > > > > > > > -----Original Message----- > > > > From: Liu, Shaoyun <shaoyun....@amd.com> > > > > Sent: Tuesday, December 12, 2023 4:00 PM > > > > To: Huang, JinHuiEric <jinhuieric.hu...@amd.com>; Kim, Jonathan > > > > <jonathan....@amd.com>; amd-gfx@lists.freedesktop.org > > > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > > > <harish.kasiviswanat...@amd.com> > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger > > > > process management > > > > > > > > [AMD Official Use Only - General] > > > > > > > > Does this requires the new MES FW for this process_ctx_flush > > > > requirement ? Can driver side add logic to guaranty when call > > > > SET_SHADER_DEBUGGER, the process address is always valid ? > > > > > > Call to flush on old fw is a NOP so it's harmless in that case. > > > Full solution will still require a new MES version as this is a > > > workaround on corner cases and not a new feature i.e. we can't stop > > > ROCm from running on old fw. > > > The process address is always valid from the driver side. It's the > > > MES side of things that gets stale as mentioned in the description > > > (passed value to MES is reused with new BO but MES doesn't refresh). > > > i.e. MES auto refreshes it's process list assuming process queues > > > were all drained but driver can't guarantee that > SET_SHADER_DEBUGGER > > (which > > > adds to MES's process list) will get called after queues get added > > > (in fact it's a requirements that it can be called at any time). > > > We can attempt to defer calls these calls in the KFD, considering all > cases. > > > But that would be a large shift in debugger/runtime_enable/KFD code, > > > which is already complicated and could get buggy plus it would not > > > be intuitive at all as to why we're doing this. > > > I think a single flag set to flush MES on process termination is a > > > simpler compromise that shows the limitation in a more obvious way. > > > > > > Thanks, > > > > > > Jon > > > > > > > > > > > > > > Regards > > > > Shaoyun.liu > > > > > > > > > > > > -----Original Message----- > > > > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of > > > > Eric Huang > > > > Sent: Tuesday, December 12, 2023 12:49 PM > > > > To: Kim, Jonathan <jonathan....@amd.com>; amd- > > > > g...@lists.freedesktop.org > > > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > > > <harish.kasiviswanat...@amd.com> > > > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger > > > > process management > > > > > > > > > > > > On 2023-12-11 16:16, Jonathan Kim wrote: > > > > > MES provides the driver a call to explicitly flush stale process > > > > > memory within the MES to avoid a race condition that results in > > > > > a fatal memory violation. > > > > > > > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory > > > > address > > > > > that represents a process context address MES uses to keep track > > > > > of future per-process calls. > > > > > > > > > > Normally, MES will purge its process context list when the last > > > > > queue has been removed. The driver, however, can call > > > > > SET_SHADER_DEBUGGER regardless of whether a queue has been > > added > > > or not. > > > > > > > > > > If SET_SHADER_DEBUGGER has been called with no queues as the > > > > > last call prior to process termination, the passed process > > > > > context address will still reside within MES. > > > > > > > > > > On a new process call to SET_SHADER_DEBUGGER, the driver may > end > > > up > > > > > passing an identical process context address value (based on > > > > > per-process gpu memory address) to MES but is now pointing to a > > > > > new allocated buffer object during KFD process creation. Since > > > > > the MES is unaware of this, access of the passed address points > > > > > to the stale object within MES and triggers a fatal memory violation. > > > > > > > > > > The solution is for KFD to explicitly flush the process context > > > > > address from MES on process termination. > > > > > > > > > > Note that the flush call and the MES debugger calls use the same > > > > > MES interface but are separated as KFD calls to avoid > > > > > conflicting with each other. > > > > > > > > > > Signed-off-by: Jonathan Kim <jonathan....@amd.com> > > > > > Tested-by: Alice Wong <shiwei.w...@amd.com> > > > > Reviewed-by: Eric Huang <jinhuieric.hu...@amd.com> > > > > > --- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 > > > > +++++++++++++++++++ > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- > > > > > .../amd/amdkfd/kfd_process_queue_manager.c | 1 + > > > > > drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- > > > > > 4 files changed, 40 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > > index e544b823abf6..e98de23250dc 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > > > @@ -916,6 +916,11 @@ int > > amdgpu_mes_set_shader_debugger(struct > > > > amdgpu_device *adev, > > > > > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > > > > op_input.set_shader_debugger.process_context_addr = > > > > process_context_addr; > > > > > op_input.set_shader_debugger.flags.u32all = flags; > > > > > + > > > > > + /* use amdgpu mes_flush_shader_debugger instead */ > > > > > + if (op_input.set_shader_debugger.flags.process_ctx_flush) > > > > > + return -EINVAL; > > > > > + > > > > > op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = > > > > spi_gdbg_per_vmid_cntl; > > > > > memcpy(op_input.set_shader_debugger.tcp_watch_cntl, > > > > tcp_watch_cntl, > > > > > > > > > > sizeof(op_input.set_shader_debugger.tcp_watch_cntl)); > > > > > @@ -935,6 +940,32 @@ int > > amdgpu_mes_set_shader_debugger(struct > > > > amdgpu_device *adev, > > > > > return r; > > > > > } > > > > > > > > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device > > *adev, > > > > > + uint64_t process_context_addr) { > > > > > + struct mes_misc_op_input op_input = {0}; > > > > > + int r; > > > > > + > > > > > + if (!adev->mes.funcs->misc_op) { > > > > > + DRM_ERROR("mes flush shader debugger is not > > supported!\n"); > > > > > + return -EINVAL; > > > > > + } > > > > > + > > > > > + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > > > > + op_input.set_shader_debugger.process_context_addr = > > > > process_context_addr; > > > > > + op_input.set_shader_debugger.flags.process_ctx_flush = > > > > > + true; > > > > > + > > > > > + amdgpu_mes_lock(&adev->mes); > > > > > + > > > > > + r = adev->mes.funcs->misc_op(&adev->mes, &op_input); > > > > > + if (r) > > > > > + DRM_ERROR("failed to set_shader_debugger\n"); > > > > > + > > > > > + amdgpu_mes_unlock(&adev->mes); > > > > > + > > > > > + return r; > > > > > +} > > > > > + > > > > > static void > > > > > amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev, > > > > > struct amdgpu_ring *ring, diff > > > > > --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > > > index 894b9b133000..7d4f93fea937 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > > > @@ -296,9 +296,10 @@ struct mes_misc_op_input { > > > > > uint64_t process_context_addr; > > > > > union { > > > > > struct { > > > > > - uint64_t single_memop : 1; > > > > > - uint64_t single_alu_op : 1; > > > > > - uint64_t reserved: 30; > > > > > + uint32_t single_memop : 1; > > > > > + uint32_t single_alu_op : 1; > > > > > + uint32_t reserved: 29; > > > > > + uint32_t process_ctx_flush: > > > > > + 1; > > > > > }; > > > > > uint32_t u32all; > > > > > } flags; > > > > > @@ -374,7 +375,8 @@ int > amdgpu_mes_set_shader_debugger(struct > > > > amdgpu_device *adev, > > > > > const uint32_t *tcp_watch_cntl, > > > > > uint32_t flags, > > > > > bool trap_en); > > > > > - > > > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device > > *adev, > > > > > + uint64_t process_context_addr); > > > > > int amdgpu_mes_add_ring(struct amdgpu_device *adev, int > gang_id, > > > > > int queue_type, int idx, > > > > > struct amdgpu_mes_ctx_data *ctx_data, diff > > > > > --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > > > index 77f493262e05..8e55e78fce4e 100644 > > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > > > @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct > > > > kfd_process_device *pdd) > > > > > return; > > > > > > > > > > dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd); > > > > > + amdgpu_mes_flush_shader_debugger(dev->adev, pdd- > > > > >proc_ctx_gpu_addr); > > > > > pdd->already_dequeued = true; > > > > > } > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > > > b/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > > > index 1fbfd1aa987e..ec5b9ab67c5e 100644 > > > > > --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > > > +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > > > @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER { > > > > > struct { > > > > > uint32_t single_memop : 1; /* > > > > > SQ_DEBUG.single_memop > > > */ > > > > > uint32_t single_alu_op : 1; /* > > > > > SQ_DEBUG.single_alu_op > */ > > > > > - uint32_t reserved : 30; > > > > > + uint32_t reserved : 29; > > > > > + uint32_t process_ctx_flush : 1; > > > > > }; > > > > > uint32_t u32all; > > > > > } flags; > > > > > > > > > > > > > > > >