[AMD Official Use Only - General]
Hi Christian, Thanks for the review. Please see inline. Best Regards, Alan -----Original Message----- From: Christian König <ckoenig.leichtzumer...@gmail.com> Sent: Tuesday, March 28, 2023 7:16 PM To: Liu, HaoPing (Alan) <haoping....@amd.com>; amd-gfx@lists.freedesktop.org Cc: Lakha, Bhawanpreet <bhawanpreet.la...@amd.com> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset Am 27.03.23 um 17:20 schrieb Alan Liu: > [Why] > After gpu-reset, sometimes the driver would fail to enable vblank irq, > causing flip_done timed out and the desktop freezed. > > During gpu-reset, we will disable and enable vblank irq in > dm_suspend() and dm_resume(). Later on in > amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and decide > to enable or disable the irqs again. > > However, we have 2 sets of API for controling vblank irq, one is > dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has > its own refcount and flag to store the state of vblank irq, and they > are not synchronized. This is the source of the problem and you should address this instead. The change you suggested below would break in some use cases. In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm layer, and in struct amdgpu_irq_src we have enabled_types as refcount for each irq controlled by the dm. I think the best solution will be to get rid of the refcount in drm and only maintain the dm one, and add a crtc function for the drm layer to get the refcount/state of vblank. But this may be dangerous since it's a change in drm layer. Do you have any comments? > > In drm we use the first API to control vblank irq but in > amdgpu_irq_gpu_reset_resume_helper() we use the second set of API. > > The failure happens when vblank irq was enabled by dm_vblank_get() > before gpu-reset, we have vblank->enabled true. However, during > gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state > checked from > amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq > again. After gpu-reset, if there is a cursor plane commit, the driver > will try to enable vblank irq by calling drm_vblank_enable(), but the > vblank->enabled is still true, so it fails to turn on vblank irq and > causes flip_done can't be completed in vblank irq handler and desktop > become freezed. > > [How] > Combining the 2 vblank control APIs by letting drm's API finally calls > amdgpu_irq's API, so the irq's refcount and state of both APIs can be > synchronized. Also add a check to prevent refcount from being less > then > 0 in amdgpu_irq_put(). > > Signed-off-by: Alan Liu <haoping....@amd.com<mailto:haoping....@amd.com>> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 3 +++ > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++++++++++---- > 2 files changed, 13 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c > index a6aef488a822..1b66003657e2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c > @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct > amdgpu_irq_src *src, > if (!src->enabled_types || !src->funcs->set) > return -EINVAL; > > + if (!amdgpu_irq_enabled(adev, src, type)) > + return 0; > + That is racy and won't work. The intention of amdgpu_irq_update() is to always update the irq state, no matter what the status is. This is a change to amdgpu_irq_put() to prevent the refcount from being cut to less than 0. Does it break the intention of amdgpu_irq_update()? Regards, Christian. > if (atomic_dec_and_test(&src->enabled_types[type])) > return amdgpu_irq_update(adev, src, type); > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > index dc4f37240beb..e04f846b0b19 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > @@ -146,7 +146,7 @@ static void vblank_control_worker(struct > work_struct *work) > > static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable) > { > - enum dc_irq_source irq_source; > + int irq_type; > struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc); > struct amdgpu_device *adev = drm_to_adev(crtc->dev); > struct dm_crtc_state *acrtc_state = to_dm_crtc_state(crtc->state); > @@ -169,10 +169,16 @@ static inline int dm_set_vblank(struct drm_crtc *crtc, > bool enable) > if (rc) > return rc; > > - irq_source = IRQ_TYPE_VBLANK + acrtc->otg_inst; > + irq_type = amdgpu_display_crtc_idx_to_irq_type(adev, > +acrtc->crtc_id); > + > + if (enable) > + rc = amdgpu_irq_get(adev, &adev->crtc_irq, irq_type); > + > + else > + rc = amdgpu_irq_put(adev, &adev->crtc_irq, irq_type); > > - if (!dc_interrupt_set(adev->dm.dc, irq_source, enable)) > - return -EBUSY; > + if (rc) > + return rc; > > skip: > if (amdgpu_in_reset(adev))