On 2021-08-16 6:13 a.m., Lazar, Lijo wrote: > On 8/13/2021 9:30 PM, Michel Dänzer wrote: >> On 2021-08-13 5:07 p.m., Lazar, Lijo wrote: >>> On 8/13/2021 8:10 PM, Michel Dänzer wrote: >>>> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote: >>>>> On 8/13/2021 7:04 PM, Michel Dänzer wrote: >>>>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote: >>>>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote: >>>>>>>> From: Michel Dänzer <mdaen...@redhat.com> >>>>>>>> >>>>>>>> schedule_delayed_work does not push back the work if it was already >>>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms >>>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF >>>>>>>> was disabled and re-enabled again during those 100 ms. >>>>>>>> >>>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41 >>>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and >>>>>>>> disabling it again (for getting the GPU clock counter). >>>>>>>> >>>>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from >>>>>>>> enabled to disabled. This makes sure the delayed work will be scheduled >>>>>>>> as intended in the reverse case. >>>>>>>> >>>>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs >>>>>>>> to use mutex_trylock instead of mutex_lock. >>>>>>>> >>>>>>>> v2: >>>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of >>>>>>>> mod_delayed_work. >>>>>>>> >>>>>>>> Signed-off-by: Michel Dänzer <mdaen...@redhat.com> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 13 +++++++------ >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 3 +++ >>>>>>>> 3 files changed, 20 insertions(+), 7 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> index f3fd5ec710b6..8b025f70706c 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> @@ -2777,7 +2777,16 @@ static void >>>>>>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work) >>>>>>>> struct amdgpu_device *adev = >>>>>>>> container_of(work, struct amdgpu_device, >>>>>>>> gfx.gfx_off_delay_work.work); >>>>>>>> - mutex_lock(&adev->gfx.gfx_off_mutex); >>>>>>>> + /* mutex_lock could deadlock with cancel_delayed_work_sync in >>>>>>>> amdgpu_gfx_off_ctrl. */ >>>>>>>> + if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) { >>>>>>>> + /* If there's a bug which causes amdgpu_gfx_off_ctrl to be >>>>>>>> called with enable=true >>>>>>>> + * when adev->gfx.gfx_off_req_count is already 0, we might >>>>>>>> race with that. >>>>>>>> + * Re-schedule to make sure gfx off will be re-enabled in the >>>>>>>> HW eventually. >>>>>>>> + */ >>>>>>>> + schedule_delayed_work(&adev->gfx.gfx_off_delay_work, >>>>>>>> AMDGPU_GFX_OFF_DELAY_ENABLE); >>>>>>>> + return; >>>>>>> >>>>>>> This is not needed and is just creating another thread to contend for >>>>>>> mutex. >>>>>> >>>>>> Still not sure what you mean by that. What other thread? >>>>> >>>>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement >>>>> further. For ex: if it was another function like gfx_off_status holding >>>>> the lock at the time of check. >>>>> >>>>>> >>>>>>> The checks below take care of enabling gfxoff correctly. If it's >>>>>>> already in gfx_off state, it doesn't do anything. So I don't see why >>>>>>> this change is needed. >>>>>> >>>>>> mutex_trylock is needed to prevent the deadlock discussed before and >>>>>> below. >>>>>> >>>>>> schedule_delayed_work is needed due to this scenario hinted at by the >>>>>> comment: >>>>>> >>>>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work >>>>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which >>>>>> fails >>>>>> >>>>>> GFXOFF would never get re-enabled in HW in this case (until >>>>>> amdgpu_gfx_off_ctrl calls schedule_delayed_work again). >>>>>> >>>>>> (cancel_delayed_work_sync guarantees there's no pending delayed work >>>>>> when it returns, even if amdgpu_device_delay_enable_gfx_off calls >>>>>> schedule_delayed_work) >>>>>> >>>>> >>>>> I think we need to explain based on the original code before. There is an >>>>> asssumption here that the only other contention of this mutex is with the >>>>> gfx_off_ctrl function. >>>> >>>> Not really. >>>> >>>> >>>>> As far as I understand if the work has already started running when >>>>> schedule_delayed_work is called, it will insert another in the work queue >>>>> after delay. Based on that understanding I didn't find a problem with the >>>>> original code. >>>> >>>> Original code as in without this patch or the mod_delayed_work patch? If >>>> so, the problem is not when the work has already started running. It's >>>> that when it hasn't started running yet, schedule_delayed_work doesn't >>>> change the timeout for the already scheduled work, so it ends up enabling >>>> GFXOFF earlier than intended (and thus at all in scenarios when it's not >>>> supposed to). >>>> >>> >>> I meant the original implementation of amdgpu_device_delay_enable_gfx_off(). >>> >>> >>> If you indeed want to use _sync, there is a small problem with this >>> implementation also which is roughly equivalent to the original problem you >>> faced. >>> >>> amdgpu_gfx_off_ctrl(disable) locks mutex >>> calls cancel_delayed_work_sync >>> amdgpu_device_delay_enable_gfx_off already started running >>> mutex_trylock fails and schedules another one >>> amdgpu_gfx_off_ctrl(enable) >>> schedules_delayed_work() - Delay is not extended, it's the same as >>> when it's rearmed from work item. >> >> >> This cannot happen. When cancel_delayed_work_sync returns, it guarantees >> that the delayed work is not scheduled >> , even if amdgpu_device_delay_enable_gfx_off called schedule_delayed_work. >> In other words, it cancels that as well. >> > > Ah, thanks! Didn't know that it will cancel out re-queued work also. In that > case, may be reduce the delay for re-queuing it - say 50% or 25% of > AMDGPU_GFX_OFF_DELAY_ENABLE. Instead of delaying GFXOFF further, it's better > to enable it faster as it's losing out to another enable or some other > function. > >>> Probably, overthinking about the solution. Looking back, mod_ version is >>> simpler :). May be just delay it further everytime there is a call with >>> enable instead of doing it only for req_cnt==0? >> >> That has some issues as well: >> >> * Still prone to the "amdgpu_device_delay_enable_gfx_off re-enables GFXOFF >> immediately after amdgpu_gfx_off_ctrl dropped req_count to 0" race if the >> former starts running between when the latter locks the mutex and calls >> mod_delayed_work. >> * If the work is not scheduled yet, mod_delayed_work would schedule it, even >> if req_count > 0, in which case it couldn't actually enable GFXOFF. >> >> Conceptually, making sure the work is never scheduled while req_count > 0 >> seems cleaner to me. It's the same principle as in the JPEG/UVD/VCE/VCN ring >> functions (which are presumably hotter paths than these amdgpu_gfx_off >> functions) I needlessly modified in patch 2. >> >> (It also means amdgpu_device_delay_enable_gfx_off technically no longer >> needs to test req_count or gfx_off_state; I can spin a v3 for that if >> desired) >> > > Would still keep the "gfx_off_state check" to avoid executing the sequence > due to buggy enable calls coming when it's already in gfxoff (if at all that > happens).
The v3 patch addresses all of these issues. -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer