Hi Alan,

Inline:

On 2023-03-30 06:48, Christian König wrote:
> Am 30.03.23 um 11:15 schrieb Liu, HaoPing (Alan):
>>
>> [AMD Official Use Only - General]
>>
>>  
>>
>> Hi Christian,
>>
>>  
>>
>> Thanks for the review. Please see inline.
>>
>>  
>>
>> Best Regards,
>>
>> Alan
>>
>>  
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumer...@gmail.com>
>> Sent: Tuesday, March 28, 2023 7:16 PM
>> To: Liu, HaoPing (Alan) <haoping....@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Lakha, Bhawanpreet <bhawanpreet.la...@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset
>>
>>  
>>
>> Am 27.03.23 um 17:20 schrieb Alan Liu:
>>
>> > [Why]
>>
>> > After gpu-reset, sometimes the driver would fail to enable vblank irq,
>>
>> > causing flip_done timed out and the desktop freezed.
>>
>> > 
>>
>> > During gpu-reset, we will disable and enable vblank irq in
>>
>> > dm_suspend() and dm_resume(). Later on in
>>
>> > amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and 
>> > decide to enable or disable the irqs again.
>>
>> > 
>>
>> > However, we have 2 sets of API for controling vblank irq, one is
>>
>> > dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
>>
>> > its own refcount and flag to store the state of vblank irq, and they
>>
>> > are not synchronized.
>>
>>  
>>
>> This is the source of the problem and you should address this instead.
>>
>> The change you suggested below would break in some use cases.
>>
>>  
>>
>> In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm 
>> layer, and in struct amdgpu_irq_src we have enabled_types as refcount for 
>> each irq controlled by the dm.
>>
>> I think the best solution will be to get rid of the refcount in drm and only 
>> maintain the dm one, and add a crtc function for the drm layer to get the 
>> refcount/state of vblank.
>>
>> But this may be dangerous since it’s a change in drm layer. Do you have any 
>> comments?
>>
> 
> You don't necessarily need to remove it completely, what you can do as well 
> is properly chaining of them.
> 
> E.g. when the DRM counter goes from 0->1 or 1->0 it calls some function to 
> enable/disable the hw irq. In this situation you call 
> amdgpu_irq_get()/amdgpu_irq_put() as appropriate.
> 
> The the code in this patch already looks like it goes into the right 
> direction regarding that. It just seems to be that you have some race issues 
> when you need to add checks that the IRQ counter doesn't goes below 0.

Changing the DRM layer is generally not a good idea, unless
there is a compelling reason to do so, like fixing a bug, or adding
a new feature benefiting all drivers. As there are many drivers using
DRM, any changes in DRM are vetted thoroughly and need a good reason to
take place.

I suggest you follow Christian's advice.

Note that there's already a callback from drm_vblank_get() down
to amdgpu_enable_vblank_kms() which calls amdgpu_irq_get(). Perhaps,
you can leverage that. Similarly for the drm_vblank_put() to
the amdgpu_vblank_put() path.

> 
>>  
>>
>> > 
>>
>> > In drm we use the first API to control vblank irq but in
>>
>> > amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
>>
>> > 
>>
>> > The failure happens when vblank irq was enabled by dm_vblank_get()
>>
>> > before gpu-reset, we have vblank->enabled true. However, during
>>
>> > gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state
>>
>> > checked from
>>
>> > amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq
>>
>> > again. After gpu-reset, if there is a cursor plane commit, the driver
>>
>> > will try to enable vblank irq by calling drm_vblank_enable(), but the
>>
>> > vblank->enabled is still true, so it fails to turn on vblank irq and
>>
>> > causes flip_done can't be completed in vblank irq handler and desktop
>>
>> > become freezed.
>>
>> > 
>>
>> > [How]
>>
>> > Combining the 2 vblank control APIs by letting drm's API finally calls
>>
>> > amdgpu_irq's API, so the irq's refcount and state of both APIs can be
>>
>> > synchronized. Also add a check to prevent refcount from being less
>>
>> > then
>>
>> > 0 in amdgpu_irq_put().
>>
>> > 
>>
>> > Signed-off-by: Alan Liu <haoping....@amd.com <mailto:haoping....@amd.com>>
>>
>> > ---
>>
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c            |  3 +++
>>
>> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++++++++++----
>>
>> >   2 files changed, 13 insertions(+), 4 deletions(-)
>>
>> > 
>>
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > index a6aef488a822..1b66003657e2 100644
>>
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct 
>> > amdgpu_irq_src *src,
>>
>> >            if (!src->enabled_types || !src->funcs->set)
>>
>> >                           return -EINVAL;
>>
>> >  
>>
>> > +         if (!amdgpu_irq_enabled(adev, src, type))
>>
>> > +                       return 0;
>>
>> > +
>>
>>  
>>
>> That is racy and won't work. The intention of amdgpu_irq_update() is to 
>> always update the irq state, no matter what the status is.
>>
>>  
>>
>> This is a change to amdgpu_irq_put() to prevent the refcount from being cut 
>> to less than 0. Does it break the intention of amdgpu_irq_update()?
>>
> 
> Yes, exactly that. The reference count can *never* be below 0 or you have a 
> bug in the caller.
> 
> What you could do is to add a WARN_ON(!amdgpu_irq_enabled(adev, src, type)), 
> but just ignoring the call is an absolute no-go.
> 

In addition to adding the WARN_ON() as Christian suggested, and noting that
you cannot ignore the amdgpu_irq_put() call, perhaps investigate if you can
allow the decrement to take place and then check for negative--atomic_t is
an "int". If it is negative, then complain, dump the stack, set to 0 and return.
-- 
Regards,
Luben


> Regards,
> Christian.
> 
> PS: Please don't use HTML formating when discussing on public mailing lists.
> 
>>  
>>
>> Regards,
>>
>> Christian.
>>
>>  
>>
>> >            if (atomic_dec_and_test(&src->enabled_types[type]))
>>
>> >                           return amdgpu_irq_update(adev, src, type);
>>
>> >  
>>
>> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > index dc4f37240beb..e04f846b0b19 100644
>>
>> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
>>
>> > @@ -146,7 +146,7 @@ static void vblank_control_worker(struct
>>
>> > work_struct *work)
>>
>> >  
>>
>> >   static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable)
>>
>> >   {
>>
>> > -          enum dc_irq_source irq_source;
>>
>> > +         int irq_type;
>>
>> >            struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
>>
>> >            struct amdgpu_device *adev = drm_to_adev(crtc->dev);
>>
>> >            struct dm_crtc_state *acrtc_state = 
>> >to_dm_crtc_state(crtc->state);
>>
>> > @@ -169,10 +169,16 @@ static inline int dm_set_vblank(struct drm_crtc 
>> > *crtc, bool enable)
>>
>> >            if (rc)
>>
>> >                           return rc;
>>
>> >  
>>
>> > -          irq_source = IRQ_TYPE_VBLANK + acrtc->otg_inst;
>>
>> > +         irq_type = amdgpu_display_crtc_idx_to_irq_type(adev,
>>
>> > +acrtc->crtc_id);
>>
>> > +
>>
>> > +         if (enable)
>>
>> > +                       rc = amdgpu_irq_get(adev, &adev->crtc_irq, 
>> > irq_type);
>>
>> > +
>>
>> > +         else
>>
>> > +                       rc = amdgpu_irq_put(adev, &adev->crtc_irq, 
>> > irq_type);
>>
>> >  
>>
>> > -          if (!dc_interrupt_set(adev->dm.dc, irq_source, enable))
>>
>> > -                        return -EBUSY;
>>
>> > +         if (rc)
>>
>> > +                       return rc;
>>
>> >  
>>
>> >   skip:
>>
>> >            if (amdgpu_in_reset(adev))
>>
>>  
>>
> 

Reply via email to