On Wed, Apr 17, 2024 at 3:17 PM Liu, Shaoyun <shaoyun....@amd.com> wrote: > > [AMD Official Use Only - General] > > I have a discussion with Christian about this before . The conclusion is > that driver should prevent multiple process from using the MES ring at the > same time . Also for current MES ring usage ,driver doesn't have the logic > to prevent the ring been overflowed and we doesn't hit the issue because > MES will wait polling for each MES submission . If we want to change the MES > to work asynchronously , we need to consider a way to avoid this (similar to > add the limit in the fence handling we use for kiq and HMM paging) >
I think we need a separate fence (different GPU address and seq number) per request. Then each caller can wait independently. Alex > Regards > Shaoyun.liu > > -----Original Message----- > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Christian > König > Sent: Wednesday, April 17, 2024 8:49 AM > To: Chen, Horace <horace.c...@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Andrey Grodzovsky <andrey.grodzov...@amd.com>; Kuehling, Felix > <felix.kuehl...@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com>; > Xiao, Jack <jack.x...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>; Liu, > Monk <monk....@amd.com>; Xu, Feifei <feifei...@amd.com>; Chang, HaiJun > <haijun.ch...@amd.com>; Leo Liu <leo.l...@amd.com>; Liu, Jenny (Jing) > <jenny-jing....@amd.com> > Subject: Re: [PATCH 3/3] drm/amdgpu/mes11: make fence waits synchronous > > Am 17.04.24 um 13:30 schrieb Horace Chen: > > The MES firmware expects synchronous operation with the driver. For > > this to work asynchronously, each caller would need to provide its own > > fence location and sequence number. > > Well that's certainly not correct. The seqno takes care that we can wait > async for the submission to complete. > > So clear NAK for that patch here. > > Regards, > Christian. > > > > > For now, add a mutex lock to serialize the MES submission. > > For SR-IOV long-wait case, break the long-wait to separated part to > > prevent this wait from impacting reset sequence. > > > > Signed-off-by: Horace Chen <horace.c...@amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 3 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 1 + > > drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 18 ++++++++++++++---- > > 3 files changed, 18 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > index 78e4f88f5134..8896be95b2c8 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > @@ -137,6 +137,7 @@ int amdgpu_mes_init(struct amdgpu_device *adev) > > spin_lock_init(&adev->mes.queue_id_lock); > > spin_lock_init(&adev->mes.ring_lock); > > mutex_init(&adev->mes.mutex_hidden); > > + mutex_init(&adev->mes.submission_lock); > > > > adev->mes.total_max_queue = AMDGPU_FENCE_MES_QUEUE_ID_MASK; > > adev->mes.vmid_mask_mmhub = 0xffffff00; @@ -221,6 +222,7 @@ int > > amdgpu_mes_init(struct amdgpu_device *adev) > > idr_destroy(&adev->mes.queue_id_idr); > > ida_destroy(&adev->mes.doorbell_ida); > > mutex_destroy(&adev->mes.mutex_hidden); > > + mutex_destroy(&adev->mes.submission_lock); > > return r; > > } > > > > @@ -240,6 +242,7 @@ void amdgpu_mes_fini(struct amdgpu_device *adev) > > idr_destroy(&adev->mes.queue_id_idr); > > ida_destroy(&adev->mes.doorbell_ida); > > mutex_destroy(&adev->mes.mutex_hidden); > > + mutex_destroy(&adev->mes.submission_lock); > > } > > > > static void amdgpu_mes_queue_free_mqd(struct amdgpu_mes_queue *q) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > index 6b3e1844eac5..90af935cc889 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > @@ -85,6 +85,7 @@ struct amdgpu_mes { > > > > struct amdgpu_ring ring; > > spinlock_t ring_lock; > > + struct mutex submission_lock; > > > > const struct firmware *fw[AMDGPU_MAX_MES_PIPES]; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > > b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > > index e40d00afd4f5..0a609a5b8835 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c > > @@ -162,6 +162,7 @@ static int > > mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, > > struct amdgpu_ring *ring = &mes->ring; > > unsigned long flags; > > signed long timeout = adev->usec_timeout; > > + signed long retry_count = 1; > > const char *op_str, *misc_op_str; > > > > if (x_pkt->header.opcode >= MES_SCH_API_MAX) @@ -169,15 +170,19 @@ > > static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes > > *mes, > > > > if (amdgpu_emu_mode) { > > timeout *= 100; > > - } else if (amdgpu_sriov_vf(adev)) { > > + } > > + > > + if (amdgpu_sriov_vf(adev) && timeout > 0) { > > /* Worst case in sriov where all other 15 VF timeout, each VF > > needs about 600ms */ > > - timeout = 15 * 600 * 1000; > > + retry_count = (15 * 600 * 1000) / timeout; > > } > > BUG_ON(size % 4 != 0); > > > > + mutex_lock(&mes->submission_lock); > > spin_lock_irqsave(&mes->ring_lock, flags); > > if (amdgpu_ring_alloc(ring, ndw)) { > > spin_unlock_irqrestore(&mes->ring_lock, flags); > > + mutex_unlock(&mes->submission_lock); > > return -ENOMEM; > > } > > > > @@ -199,8 +204,13 @@ static int > > mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, > > else > > dev_dbg(adev->dev, "MES msg=%d was emitted\n", > > x_pkt->header.opcode); > > > > - r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, > > - timeout); > > + do { > > + r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, > > + timeout); > > + retry_count--; > > + } while (retry_count > 0 && !amdgpu_in_reset(adev)); > > + > > + mutex_unlock(&mes->submission_lock); > > if (r < 1) { > > > > if (misc_op_str) >