[Public] There is no need for nested lock. It only needs to follow the order set instance power_state set profile (this takes a global lock and hence instance power state will be visible to whichever thread that gets the work profile lock).
You are seeing nested lock just because I added the code just after power state setting. Thanks, Lijo -----Original Message----- From: Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com> Sent: Thursday, August 14, 2025 5:23 PM To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com> Cc: Wu, David <david....@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition (v3) On 8/14/2025 3:16 PM, Lazar, Lijo wrote: > [Public] > > I see your point now. Attached should work, I guess. Is the concern more > about having to take the lock for every begin? This is closer, but the thing is, IMO we shouldn't have to use 2 locks and go into nested locking, we can do with just one global lock. Power_state of each instance, and global workload_profile_active are inter-related, they need to be guarded together, nested could work , but why nested if single lock is enough ? nested complicates it. Regards, Sathish > > Thanks, > Lijo > > -----Original Message----- > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of > Lazar, Lijo > Sent: Thursday, August 14, 2025 2:55 PM > To: Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com>; Alex > Deucher <alexdeuc...@gmail.com> > Cc: Wu, David <david....@amd.com>; Deucher, Alexander > <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org > Subject: RE: [PATCH] drm/amdgpu/vcn: fix video profile race condition > (v3) > > [Public] > > That is not required I think. The power profile is set by an instance *after* > setting itself to power on. Also, it's switched back after changing its power > state to off. If idle worker is run by another instance, it won't be seeing > the inst0 as power gated and won't change power profile. > > Thanks, > Lijo > -----Original Message----- > From: Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com> > Sent: Thursday, August 14, 2025 2:41 PM > To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher > <alexdeuc...@gmail.com> > Cc: Wu, David <david....@amd.com>; Deucher, Alexander > <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition > (v3) > > Hi Lijo, > > On 8/14/2025 2:11 PM, Lazar, Lijo wrote: >> [Public] >> >> We already have a per instance power state that can be tracked. What about >> something like attached? > This also has concurrent access of the power state , > vcn.inst[i].cur_state is not protected by workload_profile_mutex > > every where, it can still change while you are holding workload_profile_mutex > and checking it. > > Regards, > > Sathish > >> Thanks, >> Lijo >> -----Original Message----- >> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of >> Sundararaju, Sathishkumar >> Sent: Thursday, August 14, 2025 4:43 AM >> To: Alex Deucher <alexdeuc...@gmail.com> >> Cc: Wu, David <david....@amd.com>; Deucher, Alexander >> <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org >> Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition >> (v3) >> >> >> On 8/14/2025 3:38 AM, Alex Deucher wrote: >>> On Wed, Aug 13, 2025 at 5:1 PM Sundararaju, Sathishkumar >>> <sathishkumar.sundarar...@amd.com> wrote: >>>> On 8/14/2025 2:33 AM, Alex Deucher wrote: >>>>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar >>>>> <sathishkumar.sundarar...@amd.com> wrote: >>>>>> On 8/14/2025 1:35 AM, Alex Deucher wrote: >>>>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar >>>>>>> <sathishkumar.sundarar...@amd.com> wrote: >>>>>>>> Hi Alex, Hi David, >>>>>>>> >>>>>>>> I see David's concern but his suggestion yet wont solve the >>>>>>>> problem, neither the current form , reason :- >>>>>>>> >>>>>>>> The emitted fence count and total submission count are fast >>>>>>>> transients which frequently become 0 in between video decodes >>>>>>>> (between jobs) even with the atomics and locks there can be a >>>>>>>> switch of video power profile, in the current form of patch >>>>>>>> that window is minimized, but still can happen if stress >>>>>>>> tested. But power state of any instance becoming zero >>>>>>> Can you explain how this can happen? I'm not seeing it. >>>>>> Consider this situation, inst0 and inst1 actively decoding, inst0 >>>>>> decode completes, delayed idle work starts. >>>>>> inst0 idle handler can read 0 total fences and 0 total submission >>>>>> count, even if inst1 is actively decoding, that's between the jobs, >>>>>> - as begin_use increaments vcn.total_submission_cnt and >>>>>> end_use decreaments vcn.total_submission_cnt that can be 0. >>>>>> - if outstanding fences are cleared and no new emitted >>>>>> fence, between jobs , can be 0. >>>>>> - both of the above conditions do not mean video decode is >>>>>> complete on inst1, it is actively decoding. >>>>> How can there be active decoding without an outstanding fence? In >>>>> that case, total_fences (fences from both instances) would be non-0. >>>> I mean on inst1 the job scheduled is already complete, so 0 >>>> outstanding fences, newer job is yet to be scheduled >>>> >>>> and commited to ring (inst1) , this doesn't mean decode has stopped >>>> on >>>> inst1 right (I am saying if timing of inst0 idle work coincides >>>> with this), >>>> >>>> Or am I wrong in assuming this ? Can't this ever happen ? Please >>>> correct my understanding here. >>> The flow looks like: >>> >>> begin_use(inst) >>> emit_fence(inst) >>> end_use(inst) >>> >>> ...later >>> fence signals >>> ...later >>> work handler >>> >>> In begin_use we increment the global and per instance submission. >>> This protects the power gating and profile until end_use. In end >>> use we decrement the submissions because we don't need to protect >>> anything any more as we have the fence that was submitted via the >>> ring. That fence won't signal until the job is complete. >> Is a next begin_use always guaranteed to be run before current job fence >> signals ? >> >> if not then both total submission and total fence are zero , example >> delayed job/packet submissions >> >> from user space, or next job schedule happens after current job fence >> signals. >> >> if this is never possible then (v3) is perfect. >> >> Regards, >> >> Sathish >> >>> For power gating, we >>> only care about the submission count and fences for that instance, >>> for the profile, we care about submission count and fences all instances. >>> Once the fences have signalled, the outstanding fences will be 0 and >>> there won't be any active work. >>> >>> Alex >>> >>>> Regards, >>>> >>>> Sathish >>>> >>>>> Alex >>>>> >>>>>> Whereas if instances are powered off we are sure idle time is >>>>>> past and it is powered off, no possible way of active video >>>>>> decode, when all instances are off we can safely assume no active >>>>>> decode and global lock protects it against new begin_use on any instance. >>>>>> But the only distant concern is global common locks w.r.t perf, >>>>>> but we are already having a global workprofile mutex , so there >>>>>> shouldn't be any drop in perf, with just one single global lock >>>>>> for all instances. >>>>>> >>>>>> Just sending out a patch with this fix, will leave it to you to >>>>>> decide the right method. If you think outstanding total fences >>>>>> can never be 0 during decode, then your previous version (v3) >>>>>> itself is good, there is no real benefit of splitting the handlers as >>>>>> such. >>>>>> >>>>>> Regards, >>>>>> Sathish >>>>>>> If it is possible, maybe it would be easier to just split the >>>>>>> profile and powergating into separate handlers. The profile one >>>>>>> would be global and the powergating one would be per instance. >>>>>>> See the attached patches. >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>>> can be a sure shot indication of break in a video decode, the >>>>>>>> mistake in my patch was using per instance mutex, I should have >>>>>>>> used a common global mutex, then that covers the situation David is >>>>>>>> trying to bring out. >>>>>>>> >>>>>>>> Using one global vcn.pg_lock for idle and begin_use and using >>>>>>>> flags to track power state could help us totally avoid this situation. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Sathish >>>>>>>> >>>>>>>> On 8/13/2025 11:46 PM, Wu, David wrote: >>>>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote: >>>>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <david...@amd.com> wrote: >>>>>>>>>>> Hi Alex, >>>>>>>>>>> >>>>>>>>>>> The addition of total_submission_cnt should work - in that >>>>>>>>>>> it is unlikely to have a context switch right after the begin_use(). >>>>>>>>>>> The suggestion of moving it inside the lock (which I prefer >>>>>>>>>>> in case someone adds more before the lock and not reviewed >>>>>>>>>>> thoroughly) >>>>>>>>>>> - up to you to decide. >>>>>>>>>>> >>>>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <david....@amd.com> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> David >>>>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote: >>>>>>>>>>>> If there are multiple instances of the VCN running, we may >>>>>>>>>>>> end up switching the video profile while another instance >>>>>>>>>>>> is active because we only take into account the current >>>>>>>>>>>> instance's submissions. Look at all outstanding fences for >>>>>>>>>>>> the video profile. >>>>>>>>>>>> >>>>>>>>>>>> v2: drop early exit in begin_use() >>>>>>>>>>>> v3: handle possible race between begin_use() work handler >>>>>>>>>>>> >>>>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload >>>>>>>>>>>> profile >>>>>>>>>>>> handling") >>>>>>>>>>>> Reviewed-by: Sathishkumar S >>>>>>>>>>>> <sathishkumar.sundarar...@amd.com> (v1) >>>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> >>>>>>>>>>>> --- >>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40 >>>>>>>>>>>> ++++++++++++------------- >>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 + >>>>>>>>>>>> 2 files changed, 21 insertions(+), 20 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>>>> @@ -415,19 +415,25 @@ static void >>>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work) >>>>>>>>>>>> struct amdgpu_vcn_inst *vcn_inst = >>>>>>>>>>>> container_of(work, struct >>>>>>>>>>>> amdgpu_vcn_inst, idle_work.work); >>>>>>>>>>>> struct amdgpu_device *adev = vcn_inst->adev; >>>>>>>>>>>> - unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = >>>>>>>>>>>> {0}; >>>>>>>>>>>> - unsigned int i = vcn_inst->inst, j; >>>>>>>>>>>> + unsigned int total_fences = 0, >>>>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0}; >>>>>>>>>>>> + unsigned int i, j; >>>>>>>>>>>> int r = 0; >>>>>>>>>>>> >>>>>>>>>>>> - if (adev->vcn.harvest_config & (1 << i)) >>>>>>>>>>>> + if (adev->vcn.harvest_config & (1 << vcn_inst->inst)) >>>>>>>>>>>> return; >>>>>>>>>>>> >>>>>>>>>>>> - for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j) >>>>>>>>>>>> - fence[i] += >>>>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]); >>>>>>>>>>>> + for (i = 0; i < adev->vcn.num_vcn_inst; ++i) { >>>>>>>>>>>> + struct amdgpu_vcn_inst *v = >>>>>>>>>>>> + &adev->vcn.inst[i]; >>>>>>>>>>>> + >>>>>>>>>>>> + for (j = 0; j < v->num_enc_rings; ++j) >>>>>>>>>>>> + fence[i] += >>>>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]); >>>>>>>>>>>> + fence[i] += amdgpu_fence_count_emitted(&v->ring_dec); >>>>>>>>>>>> + total_fences += fence[i]; >>>>>>>>>>>> + } >>>>>>>>>>>> >>>>>>>>>>>> /* Only set DPG pause for VCN3 or below, VCN4 >>>>>>>>>>>> and above will be handled by FW */ >>>>>>>>>>>> if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG && >>>>>>>>>>>> - !adev->vcn.inst[i].using_unified_queue) { >>>>>>>>>>>> + !vcn_inst->using_unified_queue) { >>>>>>>>>>>> struct dpg_pause_state new_state; >>>>>>>>>>>> >>>>>>>>>>>> if (fence[i] || @@ -436,18 +442,18 @@ >>>>>>>>>>>> static void amdgpu_vcn_idle_work_handler(struct work_struct >>>>>>>>>>>> *work) >>>>>>>>>>>> else >>>>>>>>>>>> new_state.fw_based = >>>>>>>>>>>> VCN_DPG_STATE__UNPAUSE; >>>>>>>>>>>> >>>>>>>>>>>> - adev->vcn.inst[i].pause_dpg_mode(vcn_inst, >>>>>>>>>>>> &new_state); >>>>>>>>>>>> + vcn_inst->pause_dpg_mode(vcn_inst, >>>>>>>>>>>> + &new_state); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> - fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec); >>>>>>>>>>>> - fences += fence[i]; >>>>>>>>>>>> - >>>>>>>>>>>> - if (!fences && >>>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) { >>>>>>>>>>>> + if (!fence[vcn_inst->inst] && >>>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) { >>>>>>>>>>>> + /* This is specific to this instance */ >>>>>>>>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); >>>>>>>>>>>> vcn_inst->set_pg_state(vcn_inst, >>>>>>>>>>>> AMD_PG_STATE_GATE); >>>>>>>>>>>> mutex_unlock(&vcn_inst->vcn_pg_lock); >>>>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); >>>>>>>>>>>> - if (adev->vcn.workload_profile_active) { >>>>>>>>>>>> + /* This is global and depends on all VCN instances */ >>>>>>>>>>>> + if (adev->vcn.workload_profile_active && >>>>>>>>>>>> !total_fences && >>>>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) { >>>>>>>>>>>> r = >>>>>>>>>>>> amdgpu_dpm_switch_power_profile(adev, >>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, false); >>>>>>>>>>>> if (r) @@ -467,16 +473,10 @@ >>>>>>>>>>>> void amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring) >>>>>>>>>>>> int r = 0; >>>>>>>>>>>> >>>>>>>>>>>> atomic_inc(&vcn_inst->total_submission_cnt); >>>>>>>>>>>> + atomic_inc(&adev->vcn.total_submission_cnt); >>>>>>>>>>> move this addition down inside the mutex lock >>>>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work); >>>>>>>>>>>> >>>>>>>>>>>> - /* We can safely return early here because we've cancelled >>>>>>>>>>>> the >>>>>>>>>>>> - * the delayed work so there is no one else to set it to >>>>>>>>>>>> false >>>>>>>>>>>> - * and we don't care if someone else sets it to true. >>>>>>>>>>>> - */ >>>>>>>>>>>> - if (adev->vcn.workload_profile_active) >>>>>>>>>>>> - goto pg_lock; >>>>>>>>>>>> - >>>>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); >>>>>>>>>>> move to here: >>>>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt); >>>>>>>>>>> I think this should work for multiple instances. >>>>>>>>>> Why does this need to be protected by the mutex? >>>>>>>>> hmm.. OK - no need and it is actually better before the mutex. >>>>>>>>> David >>>>>>>>>> Alex >>>>>>>>>> >>>>>>>>>>> David >>>>>>>>>>>> if (!adev->vcn.workload_profile_active) { >>>>>>>>>>>> r = >>>>>>>>>>>> amdgpu_dpm_switch_power_profile(adev, >>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, @@ -487,7 +487,6 @@ void >>>>>>>>>>>> amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring) >>>>>>>>>>>> } >>>>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex); >>>>>>>>>>>> >>>>>>>>>>>> -pg_lock: >>>>>>>>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); >>>>>>>>>>>> vcn_inst->set_pg_state(vcn_inst, >>>>>>>>>>>> AMD_PG_STATE_UNGATE); >>>>>>>>>>>> >>>>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct >>>>>>>>>>>> amdgpu_ring >>>>>>>>>>>> *ring) >>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submissi >>>>>>>>>>>> o >>>>>>>>>>>> n >>>>>>>>>>>> _cnt); >>>>>>>>>>>> >>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission >>>>>>>>>>>> _ >>>>>>>>>>>> c >>>>>>>>>>>> nt); >>>>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt); >>>>>>>>>>>> >>>>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work, >>>>>>>>>>>> VCN_IDLE_TIMEOUT); diff >>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn { >>>>>>>>>>>> >>>>>>>>>>>> uint16_t inst_mask; >>>>>>>>>>>> uint8_t num_inst_per_aid; >>>>>>>>>>>> + atomic_t total_submission_cnt; >>>>>>>>>>>> >>>>>>>>>>>> /* IP reg dump */ >>>>>>>>>>>> uint32_t *ip_dump;