On 2021-08-16 6:13 a.m., Lazar, Lijo wrote:
> On 8/13/2021 9:30 PM, Michel Dänzer wrote:
>> On 2021-08-13 5:07 p.m., Lazar, Lijo wrote:
>>> On 8/13/2021 8:10 PM, Michel Dänzer wrote:
>>>> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
>>>>> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>>>>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>>>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>>>>>> From: Michel Dänzer <mdaen...@redhat.com>
>>>>>>>>
>>>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>>>
>>>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>>>
>>>>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>>>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>>>>>> as intended in the reverse case.
>>>>>>>>
>>>>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>>>>>> to use mutex_trylock instead of mutex_lock.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>>>       mod_delayed_work.
>>>>>>>>
>>>>>>>> Signed-off-by: Michel Dänzer <mdaen...@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>>>>>      3 files changed, 20 insertions(+), 7 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -2777,7 +2777,16 @@ static void 
>>>>>>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>>>          struct amdgpu_device *adev =
>>>>>>>>              container_of(work, struct amdgpu_device, 
>>>>>>>> gfx.gfx_off_delay_work.work);
>>>>>>>>      -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in 
>>>>>>>> amdgpu_gfx_off_ctrl. */
>>>>>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>>>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be 
>>>>>>>> called with enable=true
>>>>>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might 
>>>>>>>> race with that.
>>>>>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the 
>>>>>>>> HW eventually.
>>>>>>>> +         */
>>>>>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, 
>>>>>>>> AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>>>>> +        return;
>>>>>>>
>>>>>>> This is not needed and is just creating another thread to contend for 
>>>>>>> mutex.
>>>>>>
>>>>>> Still not sure what you mean by that. What other thread?
>>>>>
>>>>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement 
>>>>> further. For ex: if it was another function like gfx_off_status holding 
>>>>> the lock at the time of check.
>>>>>
>>>>>>
>>>>>>> The checks below take care of enabling gfxoff correctly. If it's 
>>>>>>> already in gfx_off state, it doesn't do anything. So I don't see why 
>>>>>>> this change is needed.
>>>>>>
>>>>>> mutex_trylock is needed to prevent the deadlock discussed before and 
>>>>>> below.
>>>>>>
>>>>>> schedule_delayed_work is needed due to this scenario hinted at by the 
>>>>>> comment:
>>>>>>
>>>>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>>>>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which 
>>>>>> fails
>>>>>>
>>>>>> GFXOFF would never get re-enabled in HW in this case (until 
>>>>>> amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>>>>>
>>>>>> (cancel_delayed_work_sync guarantees there's no pending delayed work 
>>>>>> when it returns, even if amdgpu_device_delay_enable_gfx_off calls 
>>>>>> schedule_delayed_work)
>>>>>>
>>>>>
>>>>> I think we need to explain based on the original code before. There is an 
>>>>> asssumption here that the only other contention of this mutex is with the 
>>>>> gfx_off_ctrl function.
>>>>
>>>> Not really.
>>>>
>>>>
>>>>> As far as I understand if the work has already started running when 
>>>>> schedule_delayed_work is called, it will insert another in the work queue 
>>>>> after delay. Based on that understanding I didn't find a problem with the 
>>>>> original code.
>>>>
>>>> Original code as in without this patch or the mod_delayed_work patch? If 
>>>> so, the problem is not when the work has already started running. It's 
>>>> that when it hasn't started running yet, schedule_delayed_work doesn't 
>>>> change the timeout for the already scheduled work, so it ends up enabling 
>>>> GFXOFF earlier than intended (and thus at all in scenarios when it's not 
>>>> supposed to).
>>>>
>>>
>>> I meant the original implementation of amdgpu_device_delay_enable_gfx_off().
>>>
>>>
>>> If you indeed want to use _sync, there is a small problem with this 
>>> implementation also which is roughly equivalent to the original problem you 
>>> faced.
>>>
>>> amdgpu_gfx_off_ctrl(disable) locks mutex
>>> calls cancel_delayed_work_sync
>>> amdgpu_device_delay_enable_gfx_off already started running
>>>      mutex_trylock fails and schedules another one
>>> amdgpu_gfx_off_ctrl(enable)
>>>      schedules_delayed_work() - Delay is not extended, it's the same as 
>>> when it's rearmed from work item.
>>
>>
>> This cannot happen. When cancel_delayed_work_sync returns, it guarantees 
>> that the delayed work is not scheduled
>> , even if amdgpu_device_delay_enable_gfx_off called schedule_delayed_work. 
>> In other words, it cancels that as well.
>>
> 
> Ah, thanks! Didn't know that it will cancel out re-queued work also. In that 
> case, may be reduce the delay for re-queuing it - say 50% or 25% of 
> AMDGPU_GFX_OFF_DELAY_ENABLE. Instead of delaying GFXOFF further, it's better 
> to enable it faster as it's losing out to another enable or some other 
> function.
> 
>>> Probably, overthinking about the solution. Looking back, mod_ version is 
>>> simpler :). May be just delay it further everytime there is a call with 
>>> enable instead of doing it only for req_cnt==0?
>>
>> That has some issues as well:
>>
>> * Still prone to the "amdgpu_device_delay_enable_gfx_off re-enables GFXOFF 
>> immediately after amdgpu_gfx_off_ctrl dropped req_count to 0" race if the 
>> former starts running between when the latter locks the mutex and calls 
>> mod_delayed_work.
>> * If the work is not scheduled yet, mod_delayed_work would schedule it, even 
>> if req_count > 0, in which case it couldn't actually enable GFXOFF.
>>
>> Conceptually, making sure the work is never scheduled while req_count > 0 
>> seems cleaner to me. It's the same principle as in the JPEG/UVD/VCE/VCN ring 
>> functions (which are presumably hotter paths than these amdgpu_gfx_off 
>> functions) I needlessly modified in patch 2.
>>
>> (It also means amdgpu_device_delay_enable_gfx_off technically no longer 
>> needs to test req_count or gfx_off_state; I can spin a v3 for that if 
>> desired)
>>
> 
> Would still keep the "gfx_off_state check" to avoid executing the sequence 
> due to buggy enable calls coming when it's already in gfxoff (if at all that 
> happens).

The v3 patch addresses all of these issues.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

Reply via email to