[Public]

There is no need for nested lock. It only needs to follow the order
        set instance power_state
        set profile (this takes a global lock and hence instance power state 
will be visible to whichever thread that gets the work profile lock).

You are seeing nested lock just because I added the code just after power state 
setting.

Thanks,
Lijo

-----Original Message-----
From: Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com>
Sent: Thursday, August 14, 2025 5:23 PM
To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher <alexdeuc...@gmail.com>
Cc: Wu, David <david....@amd.com>; Deucher, Alexander 
<alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)


On 8/14/2025 3:16 PM, Lazar, Lijo wrote:
> [Public]
>
> I see your point now. Attached should work, I guess. Is the concern more 
> about having to take the lock for every begin?

This is closer,  but the thing is, IMO we shouldn't have to use 2 locks and go 
into nested locking, we can do with just one global lock.

Power_state of each instance, and global workload_profile_active are 
inter-related, they need to be guarded together,

nested could work , but why nested if single lock is enough ? nested 
complicates it.

Regards,

Sathish

>
> Thanks,
> Lijo
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
> Lazar, Lijo
> Sent: Thursday, August 14, 2025 2:55 PM
> To: Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com>; Alex
> Deucher <alexdeuc...@gmail.com>
> Cc: Wu, David <david....@amd.com>; Deucher, Alexander
> <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu/vcn: fix video profile race condition
> (v3)
>
> [Public]
>
> That is not required I think. The power profile is set by an instance *after* 
> setting itself to power on. Also, it's switched back after changing its power 
> state to off.  If idle worker is run by another instance, it won't be seeing 
> the inst0 as power gated and won't change power profile.
>
> Thanks,
> Lijo
> -----Original Message-----
> From: Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com>
> Sent: Thursday, August 14, 2025 2:41 PM
> To: Lazar, Lijo <lijo.la...@amd.com>; Alex Deucher
> <alexdeuc...@gmail.com>
> Cc: Wu, David <david....@amd.com>; Deucher, Alexander
> <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition
> (v3)
>
> Hi Lijo,
>
> On 8/14/2025 2:11 PM, Lazar, Lijo wrote:
>> [Public]
>>
>> We already have a per instance power state that can be tracked. What about 
>> something like attached?
> This also has concurrent access of the power state ,
> vcn.inst[i].cur_state is not protected by workload_profile_mutex
>
> every where, it can still change while you are holding workload_profile_mutex 
> and checking it.
>
> Regards,
>
> Sathish
>
>> Thanks,
>> Lijo
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
>> Sundararaju, Sathishkumar
>> Sent: Thursday, August 14, 2025 4:43 AM
>> To: Alex Deucher <alexdeuc...@gmail.com>
>> Cc: Wu, David <david....@amd.com>; Deucher, Alexander
>> <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition
>> (v3)
>>
>>
>> On 8/14/2025 3:38 AM, Alex Deucher wrote:
>>> On Wed, Aug 13, 2025 at 5:1 PM Sundararaju, Sathishkumar
>>> <sathishkumar.sundarar...@amd.com> wrote:
>>>> On 8/14/2025 2:33 AM, Alex Deucher wrote:
>>>>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
>>>>> <sathishkumar.sundarar...@amd.com> wrote:
>>>>>> On 8/14/2025 1:35 AM, Alex Deucher wrote:
>>>>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
>>>>>>> <sathishkumar.sundarar...@amd.com> wrote:
>>>>>>>> Hi Alex, Hi David,
>>>>>>>>
>>>>>>>> I see David's concern but his suggestion yet wont solve the
>>>>>>>> problem, neither the current form , reason :-
>>>>>>>>
>>>>>>>> The emitted fence count and total submission count are fast
>>>>>>>> transients which frequently become 0 in between video decodes
>>>>>>>> (between jobs) even with the atomics and locks there can be a
>>>>>>>> switch of video power profile, in the current form of patch
>>>>>>>> that window is minimized, but still can happen if stress
>>>>>>>> tested. But power state of any instance becoming zero
>>>>>>> Can you explain how this can happen?  I'm not seeing it.
>>>>>> Consider this situation, inst0 and inst1 actively decoding, inst0
>>>>>> decode completes, delayed idle work starts.
>>>>>> inst0 idle handler can read 0 total fences and 0 total submission
>>>>>> count, even if inst1 is actively decoding, that's between the jobs,
>>>>>>       - as begin_use increaments vcn.total_submission_cnt and
>>>>>> end_use decreaments vcn.total_submission_cnt that can be 0.
>>>>>>       - if outstanding fences are cleared and no new emitted
>>>>>> fence, between jobs , can be 0.
>>>>>>       - both of the above conditions do not mean video decode is
>>>>>> complete on inst1, it is actively decoding.
>>>>> How can there be active decoding without an outstanding fence?  In
>>>>> that case, total_fences (fences from both instances) would be non-0.
>>>> I mean on inst1 the job scheduled is already complete, so 0
>>>> outstanding fences, newer job is yet to be scheduled
>>>>
>>>> and commited to ring (inst1) , this doesn't mean decode has stopped
>>>> on
>>>> inst1 right (I am saying if timing of inst0 idle work coincides
>>>> with this),
>>>>
>>>> Or am I wrong in assuming this ? Can't this ever happen ? Please
>>>> correct my understanding here.
>>> The flow looks like:
>>>
>>> begin_use(inst)
>>> emit_fence(inst)
>>> end_use(inst)
>>>
>>> ...later
>>> fence signals
>>> ...later
>>> work handler
>>>
>>> In begin_use we increment the global and per instance submission.
>>> This protects the power gating and profile until end_use.  In end
>>> use we decrement the submissions because we don't need to protect
>>> anything any more as we have the fence that was submitted via the
>>> ring.  That fence won't signal until the job is complete.
>> Is a next begin_use always guaranteed to be run before current job fence 
>> signals ?
>>
>> if not then both total submission and total fence are zero , example
>> delayed job/packet submissions
>>
>> from user space, or next job schedule happens after current job fence 
>> signals.
>>
>> if this is never possible then (v3) is perfect.
>>
>> Regards,
>>
>> Sathish
>>
>>> For power gating, we
>>> only care about the submission count and fences for that instance,
>>> for the profile, we care about submission count and fences all instances.
>>> Once the fences have signalled, the outstanding fences will be 0 and
>>> there won't be any active work.
>>>
>>> Alex
>>>
>>>> Regards,
>>>>
>>>> Sathish
>>>>
>>>>> Alex
>>>>>
>>>>>> Whereas if instances are powered off we are sure idle time is
>>>>>> past and it is powered off, no possible way of active video
>>>>>> decode, when all instances are off we can safely assume no active
>>>>>> decode and global lock protects it against new begin_use on any instance.
>>>>>> But the only distant concern is global common locks w.r.t perf,
>>>>>> but we are already having a global workprofile mutex , so there
>>>>>> shouldn't be any drop in perf, with just one single global lock
>>>>>> for all instances.
>>>>>>
>>>>>> Just sending out a patch with this fix, will leave it to you to
>>>>>> decide the right method. If you think outstanding total fences
>>>>>> can never be 0 during decode, then your previous version (v3)
>>>>>> itself is good, there is no real benefit of splitting the handlers as 
>>>>>> such.
>>>>>>
>>>>>> Regards,
>>>>>> Sathish
>>>>>>> If it is possible, maybe it would be easier to just split the
>>>>>>> profile and powergating into separate handlers.  The profile one
>>>>>>> would be global and the powergating one would be per instance.
>>>>>>> See the attached patches.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>> can be a sure shot indication of break in a video decode, the
>>>>>>>> mistake in my patch was using per instance mutex, I should have
>>>>>>>> used a common global mutex, then that covers the situation David is 
>>>>>>>> trying to bring out.
>>>>>>>>
>>>>>>>> Using one global vcn.pg_lock for idle and begin_use and using
>>>>>>>> flags to track power state could help us totally avoid this situation.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Sathish
>>>>>>>>
>>>>>>>> On 8/13/2025 11:46 PM, Wu, David wrote:
>>>>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote:
>>>>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <david...@amd.com> wrote:
>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>
>>>>>>>>>>> The addition of  total_submission_cnt should work - in that
>>>>>>>>>>> it is unlikely to have a context switch right after the begin_use().
>>>>>>>>>>> The suggestion of moving it inside the lock (which I prefer
>>>>>>>>>>> in case someone adds more before the lock and not reviewed
>>>>>>>>>>> thoroughly)
>>>>>>>>>>>         - up to you to decide.
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <david....@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> David
>>>>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote:
>>>>>>>>>>>> If there are multiple instances of the VCN running, we may
>>>>>>>>>>>> end up switching the video profile while another instance
>>>>>>>>>>>> is active because we only take into account the current
>>>>>>>>>>>> instance's submissions.  Look at all outstanding fences for
>>>>>>>>>>>> the video profile.
>>>>>>>>>>>>
>>>>>>>>>>>> v2: drop early exit in begin_use()
>>>>>>>>>>>> v3: handle possible race between begin_use() work handler
>>>>>>>>>>>>
>>>>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload
>>>>>>>>>>>> profile
>>>>>>>>>>>> handling")
>>>>>>>>>>>> Reviewed-by: Sathishkumar S
>>>>>>>>>>>> <sathishkumar.sundarar...@amd.com> (v1)
>>>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
>>>>>>>>>>>> ++++++++++++-------------
>>>>>>>>>>>>         drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  1 +
>>>>>>>>>>>>         2 files changed, 21 insertions(+), 20 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>>>>>>>>>> @@ -415,19 +415,25 @@ static void
>>>>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work)
>>>>>>>>>>>>             struct amdgpu_vcn_inst *vcn_inst =
>>>>>>>>>>>>                     container_of(work, struct
>>>>>>>>>>>> amdgpu_vcn_inst, idle_work.work);
>>>>>>>>>>>>             struct amdgpu_device *adev = vcn_inst->adev;
>>>>>>>>>>>> -     unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = 
>>>>>>>>>>>> {0};
>>>>>>>>>>>> -     unsigned int i = vcn_inst->inst, j;
>>>>>>>>>>>> +     unsigned int total_fences = 0,
>>>>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
>>>>>>>>>>>> +     unsigned int i, j;
>>>>>>>>>>>>             int r = 0;
>>>>>>>>>>>>
>>>>>>>>>>>> -     if (adev->vcn.harvest_config & (1 << i))
>>>>>>>>>>>> +     if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
>>>>>>>>>>>>                     return;
>>>>>>>>>>>>
>>>>>>>>>>>> -     for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
>>>>>>>>>>>> -             fence[i] +=
>>>>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
>>>>>>>>>>>> +     for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>>>>>>>>>>>> +             struct amdgpu_vcn_inst *v =
>>>>>>>>>>>> + &adev->vcn.inst[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> +             for (j = 0; j < v->num_enc_rings; ++j)
>>>>>>>>>>>> +                     fence[i] +=
>>>>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]);
>>>>>>>>>>>> +             fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
>>>>>>>>>>>> +             total_fences += fence[i];
>>>>>>>>>>>> +     }
>>>>>>>>>>>>
>>>>>>>>>>>>             /* Only set DPG pause for VCN3 or below, VCN4
>>>>>>>>>>>> and above will be handled by FW */
>>>>>>>>>>>>             if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
>>>>>>>>>>>> -         !adev->vcn.inst[i].using_unified_queue) {
>>>>>>>>>>>> +         !vcn_inst->using_unified_queue) {
>>>>>>>>>>>>                     struct dpg_pause_state new_state;
>>>>>>>>>>>>
>>>>>>>>>>>>                     if (fence[i] || @@ -436,18 +442,18 @@
>>>>>>>>>>>> static void amdgpu_vcn_idle_work_handler(struct work_struct
>>>>>>>>>>>> *work)
>>>>>>>>>>>>                     else
>>>>>>>>>>>>                             new_state.fw_based =
>>>>>>>>>>>> VCN_DPG_STATE__UNPAUSE;
>>>>>>>>>>>>
>>>>>>>>>>>> -             adev->vcn.inst[i].pause_dpg_mode(vcn_inst, 
>>>>>>>>>>>> &new_state);
>>>>>>>>>>>> +             vcn_inst->pause_dpg_mode(vcn_inst,
>>>>>>>>>>>> + &new_state);
>>>>>>>>>>>>             }
>>>>>>>>>>>>
>>>>>>>>>>>> -     fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
>>>>>>>>>>>> -     fences += fence[i];
>>>>>>>>>>>> -
>>>>>>>>>>>> -     if (!fences && 
>>>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>>>> +     if (!fence[vcn_inst->inst] &&
>>>>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) {
>>>>>>>>>>>> +             /* This is specific to this instance */
>>>>>>>>>>>>                     mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>>                     vcn_inst->set_pg_state(vcn_inst, 
>>>>>>>>>>>> AMD_PG_STATE_GATE);
>>>>>>>>>>>>                     mutex_unlock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>>> -             if (adev->vcn.workload_profile_active) {
>>>>>>>>>>>> +             /* This is global and depends on all VCN instances */
>>>>>>>>>>>> +             if (adev->vcn.workload_profile_active &&
>>>>>>>>>>>> !total_fences &&
>>>>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) {
>>>>>>>>>>>>                             r =
>>>>>>>>>>>> amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, false);
>>>>>>>>>>>>                             if (r) @@ -467,16 +473,10 @@
>>>>>>>>>>>> void amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring)
>>>>>>>>>>>>             int r = 0;
>>>>>>>>>>>>
>>>>>>>>>>>>             atomic_inc(&vcn_inst->total_submission_cnt);
>>>>>>>>>>>> +     atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>>>> move this addition down inside the mutex lock
>>>>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work);
>>>>>>>>>>>>
>>>>>>>>>>>> -     /* We can safely return early here because we've cancelled 
>>>>>>>>>>>> the
>>>>>>>>>>>> -      * the delayed work so there is no one else to set it to 
>>>>>>>>>>>> false
>>>>>>>>>>>> -      * and we don't care if someone else sets it to true.
>>>>>>>>>>>> -      */
>>>>>>>>>>>> -     if (adev->vcn.workload_profile_active)
>>>>>>>>>>>> -             goto pg_lock;
>>>>>>>>>>>> -
>>>>>>>>>>>>             mutex_lock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>> move to here:
>>>>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt);
>>>>>>>>>>> I think this should work for multiple instances.
>>>>>>>>>> Why does this need to be protected by the mutex?
>>>>>>>>> hmm.. OK - no need and it is actually better before the mutex.
>>>>>>>>> David
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>>             if (!adev->vcn.workload_profile_active) {
>>>>>>>>>>>>                     r =
>>>>>>>>>>>> amdgpu_dpm_switch_power_profile(adev,
>>>>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, @@ -487,7 +487,6 @@ void
>>>>>>>>>>>> amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring)
>>>>>>>>>>>>             }
>>>>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex);
>>>>>>>>>>>>
>>>>>>>>>>>> -pg_lock:
>>>>>>>>>>>>             mutex_lock(&vcn_inst->vcn_pg_lock);
>>>>>>>>>>>>             vcn_inst->set_pg_state(vcn_inst,
>>>>>>>>>>>> AMD_PG_STATE_UNGATE);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct
>>>>>>>>>>>> amdgpu_ring
>>>>>>>>>>>> *ring)
>>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submissi
>>>>>>>>>>>> o
>>>>>>>>>>>> n
>>>>>>>>>>>> _cnt);
>>>>>>>>>>>>
>>>>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission
>>>>>>>>>>>> _
>>>>>>>>>>>> c
>>>>>>>>>>>> nt);
>>>>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt);
>>>>>>>>>>>>
>>>>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
>>>>>>>>>>>>                                   VCN_IDLE_TIMEOUT); diff
>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
>>>>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn {
>>>>>>>>>>>>
>>>>>>>>>>>>             uint16_t inst_mask;
>>>>>>>>>>>>             uint8_t num_inst_per_aid;
>>>>>>>>>>>> +     atomic_t                total_submission_cnt;
>>>>>>>>>>>>
>>>>>>>>>>>>             /* IP reg dump */
>>>>>>>>>>>>             uint32_t                *ip_dump;

Reply via email to