On 3/30/23 09:01, Luben Tuikov wrote:
> Hi Alan,
> 
> Inline:
> 
> On 2023-03-30 06:48, Christian König wrote:
>> Am 30.03.23 um 11:15 schrieb Liu, HaoPing (Alan):
>>>
>>> [AMD Official Use Only - General]
>>>
>>>  
>>>
>>> Hi Christian,
>>>
>>>  
>>>
>>> Thanks for the review. Please see inline.
>>>
>>>  
>>>
>>> Best Regards,
>>>
>>> Alan
>>>
>>>  
>>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumer...@gmail.com>
>>> Sent: Tuesday, March 28, 2023 7:16 PM
>>> To: Liu, HaoPing (Alan) <haoping....@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Lakha, Bhawanpreet <bhawanpreet.la...@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset
>>>
>>>  
>>>
>>> Am 27.03.23 um 17:20 schrieb Alan Liu:
>>>
>>>> [Why]
>>>
>>>> After gpu-reset, sometimes the driver would fail to enable vblank irq,
>>>
>>>> causing flip_done timed out and the desktop freezed.
>>>
>>>>  
>>>
>>>> During gpu-reset, we will disable and enable vblank irq in
>>>
>>>> dm_suspend() and dm_resume(). Later on in
>>>
>>>> amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and 
>>>> decide to enable or disable the irqs again.
>>>
>>>>  
>>>
>>>> However, we have 2 sets of API for controling vblank irq, one is
>>>
>>>> dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
>>>
>>>> its own refcount and flag to store the state of vblank irq, and they
>>>
>>>> are not synchronized.
>>>
>>>  
>>>
>>> This is the source of the problem and you should address this instead.
>>>
>>> The change you suggested below would break in some use cases.
>>>
>>>  
>>>
>>> In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm 
>>> layer, and in struct amdgpu_irq_src we have enabled_types as refcount for 
>>> each irq controlled by the dm.
>>>
>>> I think the best solution will be to get rid of the refcount in drm and 
>>> only maintain the dm one, and add a crtc function for the drm layer to get 
>>> the refcount/state of vblank.
>>>
>>> But this may be dangerous since it’s a change in drm layer. Do you have any 
>>> comments?
>>>
>>
>> You don't necessarily need to remove it completely, what you can do as well 
>> is properly chaining of them.
>>
>> E.g. when the DRM counter goes from 0->1 or 1->0 it calls some function to 
>> enable/disable the hw irq. In this situation you call 
>> amdgpu_irq_get()/amdgpu_irq_put() as appropriate.
>>
>> The the code in this patch already looks like it goes into the right 
>> direction regarding that. It just seems to be that you have some race issues 
>> when you need to add checks that the IRQ counter doesn't goes below 0.
> 
> Changing the DRM layer is generally not a good idea, unless
> there is a compelling reason to do so, like fixing a bug, or adding
> a new feature benefiting all drivers. As there are many drivers using
> DRM, any changes in DRM are vetted thoroughly and need a good reason to
> take place.
> 

While I agree in this case I also observe that we sometimes shy away from
DRM changes when they make sense and instead do weird workarounds in
our driver. I would encourage DRM changes when they make sense but when
one does so one needs to make sure none of the other DRM drivers break.

Harry

> I suggest you follow Christian's advice.
> 
> Note that there's already a callback from drm_vblank_get() down
> to amdgpu_enable_vblank_kms() which calls amdgpu_irq_get(). Perhaps,
> you can leverage that. Similarly for the drm_vblank_put() to
> the amdgpu_vblank_put() path.
> 
>>
>>>  
>>>
>>>>  
>>>
>>>> In drm we use the first API to control vblank irq but in
>>>
>>>> amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
>>>
>>>>  
>>>
>>>> The failure happens when vblank irq was enabled by dm_vblank_get()
>>>
>>>> before gpu-reset, we have vblank->enabled true. However, during
>>>
>>>> gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state
>>>
>>>> checked from
>>>
>>>> amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq
>>>
>>>> again. After gpu-reset, if there is a cursor plane commit, the driver
>>>
>>>> will try to enable vblank irq by calling drm_vblank_enable(), but the
>>>
>>>> vblank->enabled is still true, so it fails to turn on vblank irq and
>>>
>>>> causes flip_done can't be completed in vblank irq handler and desktop
>>>
>>>> become freezed.
>>>
>>>>  
>>>
>>>> [How]
>>>
>>>> Combining the 2 vblank control APIs by letting drm's API finally calls
>>>
>>>> amdgpu_irq's API, so the irq's refcount and state of both APIs can be
>>>
>>>> synchronized. Also add a check to prevent refcount from being less
>>>
>>>> then
>>>
>>>> 0 in amdgpu_irq_put().
>>>
>>>>  
>>>
>>>> Signed-off-by: Alan Liu <haoping....@amd.com <mailto:haoping....@amd.com>>
>>>
>>>> ---
>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c            |  3 +++
>>>
>>>>    .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++++++++++----
>>>
>>>>    2 files changed, 13 insertions(+), 4 deletions(-)
>>>
>>>>  
>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>
>>>> index a6aef488a822..1b66003657e2 100644
>>>
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>
>>>> @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct 
>>>> amdgpu_irq_src *src,
>>>
>>>>             if (!src->enabled_types || !src->funcs->set)
>>>
>>>>                            return -EINVAL;
>>>
>>>>   
>>>
>>>> +         if (!amdgpu_irq_enabled(adev, src, type))
>>>
>>>> +                       return 0;
>>>
>>>> +
>>>
>>>  
>>>
>>> That is racy and won't work. The intention of amdgpu_irq_update() is to 
>>> always update the irq state, no matter what the status is.
>>>
>>>  
>>>
>>> This is a change to amdgpu_irq_put() to prevent the refcount from being cut 
>>> to less than 0. Does it break the intention of amdgpu_irq_update()?
>>>
>>
>> Yes, exactly that. The reference count can *never* be below 0 or you have a 
>> bug in the caller.
>>
>> What you could do is to add a WARN_ON(!amdgpu_irq_enabled(adev, src, type)), 
>> but just ignoring the call is an absolute no-go.
>>
> 
> In addition to adding the WARN_ON() as Christian suggested, and noting that
> you cannot ignore the amdgpu_irq_put() call, perhaps investigate if you can
> allow the decrement to take place and then check for negative--atomic_t is
> an "int". If it is negative, then complain, dump the stack, set to 0 and 
> return.

Reply via email to