Hi Alan, Inline:
On 2023-03-30 06:48, Christian König wrote: > Am 30.03.23 um 11:15 schrieb Liu, HaoPing (Alan): >> >> [AMD Official Use Only - General] >> >> >> >> Hi Christian, >> >> >> >> Thanks for the review. Please see inline. >> >> >> >> Best Regards, >> >> Alan >> >> >> >> -----Original Message----- >> From: Christian König <ckoenig.leichtzumer...@gmail.com> >> Sent: Tuesday, March 28, 2023 7:16 PM >> To: Liu, HaoPing (Alan) <haoping....@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Lakha, Bhawanpreet <bhawanpreet.la...@amd.com> >> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset >> >> >> >> Am 27.03.23 um 17:20 schrieb Alan Liu: >> >> > [Why] >> >> > After gpu-reset, sometimes the driver would fail to enable vblank irq, >> >> > causing flip_done timed out and the desktop freezed. >> >> > >> >> > During gpu-reset, we will disable and enable vblank irq in >> >> > dm_suspend() and dm_resume(). Later on in >> >> > amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and >> > decide to enable or disable the irqs again. >> >> > >> >> > However, we have 2 sets of API for controling vblank irq, one is >> >> > dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has >> >> > its own refcount and flag to store the state of vblank irq, and they >> >> > are not synchronized. >> >> >> >> This is the source of the problem and you should address this instead. >> >> The change you suggested below would break in some use cases. >> >> >> >> In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm >> layer, and in struct amdgpu_irq_src we have enabled_types as refcount for >> each irq controlled by the dm. >> >> I think the best solution will be to get rid of the refcount in drm and only >> maintain the dm one, and add a crtc function for the drm layer to get the >> refcount/state of vblank. >> >> But this may be dangerous since it’s a change in drm layer. Do you have any >> comments? >> > > You don't necessarily need to remove it completely, what you can do as well > is properly chaining of them. > > E.g. when the DRM counter goes from 0->1 or 1->0 it calls some function to > enable/disable the hw irq. In this situation you call > amdgpu_irq_get()/amdgpu_irq_put() as appropriate. > > The the code in this patch already looks like it goes into the right > direction regarding that. It just seems to be that you have some race issues > when you need to add checks that the IRQ counter doesn't goes below 0. Changing the DRM layer is generally not a good idea, unless there is a compelling reason to do so, like fixing a bug, or adding a new feature benefiting all drivers. As there are many drivers using DRM, any changes in DRM are vetted thoroughly and need a good reason to take place. I suggest you follow Christian's advice. Note that there's already a callback from drm_vblank_get() down to amdgpu_enable_vblank_kms() which calls amdgpu_irq_get(). Perhaps, you can leverage that. Similarly for the drm_vblank_put() to the amdgpu_vblank_put() path. > >> >> >> > >> >> > In drm we use the first API to control vblank irq but in >> >> > amdgpu_irq_gpu_reset_resume_helper() we use the second set of API. >> >> > >> >> > The failure happens when vblank irq was enabled by dm_vblank_get() >> >> > before gpu-reset, we have vblank->enabled true. However, during >> >> > gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state >> >> > checked from >> >> > amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq >> >> > again. After gpu-reset, if there is a cursor plane commit, the driver >> >> > will try to enable vblank irq by calling drm_vblank_enable(), but the >> >> > vblank->enabled is still true, so it fails to turn on vblank irq and >> >> > causes flip_done can't be completed in vblank irq handler and desktop >> >> > become freezed. >> >> > >> >> > [How] >> >> > Combining the 2 vblank control APIs by letting drm's API finally calls >> >> > amdgpu_irq's API, so the irq's refcount and state of both APIs can be >> >> > synchronized. Also add a check to prevent refcount from being less >> >> > then >> >> > 0 in amdgpu_irq_put(). >> >> > >> >> > Signed-off-by: Alan Liu <haoping....@amd.com <mailto:haoping....@amd.com>> >> >> > --- >> >> > drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 3 +++ >> >> > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++++++++++---- >> >> > 2 files changed, 13 insertions(+), 4 deletions(-) >> >> > >> >> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> >> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> >> > index a6aef488a822..1b66003657e2 100644 >> >> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> >> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c >> >> > @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct >> > amdgpu_irq_src *src, >> >> > if (!src->enabled_types || !src->funcs->set) >> >> > return -EINVAL; >> >> > >> >> > + if (!amdgpu_irq_enabled(adev, src, type)) >> >> > + return 0; >> >> > + >> >> >> >> That is racy and won't work. The intention of amdgpu_irq_update() is to >> always update the irq state, no matter what the status is. >> >> >> >> This is a change to amdgpu_irq_put() to prevent the refcount from being cut >> to less than 0. Does it break the intention of amdgpu_irq_update()? >> > > Yes, exactly that. The reference count can *never* be below 0 or you have a > bug in the caller. > > What you could do is to add a WARN_ON(!amdgpu_irq_enabled(adev, src, type)), > but just ignoring the call is an absolute no-go. > In addition to adding the WARN_ON() as Christian suggested, and noting that you cannot ignore the amdgpu_irq_put() call, perhaps investigate if you can allow the decrement to take place and then check for negative--atomic_t is an "int". If it is negative, then complain, dump the stack, set to 0 and return. -- Regards, Luben > Regards, > Christian. > > PS: Please don't use HTML formating when discussing on public mailing lists. > >> >> >> Regards, >> >> Christian. >> >> >> >> > if (atomic_dec_and_test(&src->enabled_types[type])) >> >> > return amdgpu_irq_update(adev, src, type); >> >> > >> >> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c >> >> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c >> >> > index dc4f37240beb..e04f846b0b19 100644 >> >> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c >> >> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c >> >> > @@ -146,7 +146,7 @@ static void vblank_control_worker(struct >> >> > work_struct *work) >> >> > >> >> > static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable) >> >> > { >> >> > - enum dc_irq_source irq_source; >> >> > + int irq_type; >> >> > struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc); >> >> > struct amdgpu_device *adev = drm_to_adev(crtc->dev); >> >> > struct dm_crtc_state *acrtc_state = >> >to_dm_crtc_state(crtc->state); >> >> > @@ -169,10 +169,16 @@ static inline int dm_set_vblank(struct drm_crtc >> > *crtc, bool enable) >> >> > if (rc) >> >> > return rc; >> >> > >> >> > - irq_source = IRQ_TYPE_VBLANK + acrtc->otg_inst; >> >> > + irq_type = amdgpu_display_crtc_idx_to_irq_type(adev, >> >> > +acrtc->crtc_id); >> >> > + >> >> > + if (enable) >> >> > + rc = amdgpu_irq_get(adev, &adev->crtc_irq, >> > irq_type); >> >> > + >> >> > + else >> >> > + rc = amdgpu_irq_put(adev, &adev->crtc_irq, >> > irq_type); >> >> > >> >> > - if (!dc_interrupt_set(adev->dm.dc, irq_source, enable)) >> >> > - return -EBUSY; >> >> > + if (rc) >> >> > + return rc; >> >> > >> >> > skip: >> >> > if (amdgpu_in_reset(adev)) >> >> >> >