[Public] We already have a per instance power state that can be tracked. What about something like attached?
Thanks, Lijo -----Original Message----- From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Sundararaju, Sathishkumar Sent: Thursday, August 14, 2025 4:43 AM To: Alex Deucher <alexdeuc...@gmail.com> Cc: Wu, David <david....@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition (v3) On 8/14/2025 3:38 AM, Alex Deucher wrote: > On Wed, Aug 13, 2025 at 5:16 PM Sundararaju, Sathishkumar > <sathishkumar.sundarar...@amd.com> wrote: >> >> On 8/14/2025 2:33 AM, Alex Deucher wrote: >>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar >>> <sathishkumar.sundarar...@amd.com> wrote: >>>> On 8/14/2025 1:35 AM, Alex Deucher wrote: >>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar >>>>> <sathishkumar.sundarar...@amd.com> wrote: >>>>>> Hi Alex, Hi David, >>>>>> >>>>>> I see David's concern but his suggestion yet wont solve the >>>>>> problem, neither the current form , reason :- >>>>>> >>>>>> The emitted fence count and total submission count are fast >>>>>> transients which frequently become 0 in between video decodes >>>>>> (between jobs) even with the atomics and locks there can be a >>>>>> switch of video power profile, in the current form of patch that >>>>>> window is minimized, but still can happen if stress tested. But >>>>>> power state of any instance becoming zero >>>>> Can you explain how this can happen? I'm not seeing it. >>>> Consider this situation, inst0 and inst1 actively decoding, inst0 >>>> decode completes, delayed idle work starts. >>>> inst0 idle handler can read 0 total fences and 0 total submission >>>> count, even if inst1 is actively decoding, that's between the jobs, >>>> - as begin_use increaments vcn.total_submission_cnt and end_use >>>> decreaments vcn.total_submission_cnt that can be 0. >>>> - if outstanding fences are cleared and no new emitted fence, >>>> between jobs , can be 0. >>>> - both of the above conditions do not mean video decode is >>>> complete on inst1, it is actively decoding. >>> How can there be active decoding without an outstanding fence? In >>> that case, total_fences (fences from both instances) would be non-0. >> I mean on inst1 the job scheduled is already complete, so 0 >> outstanding fences, newer job is yet to be scheduled >> >> and commited to ring (inst1) , this doesn't mean decode has stopped >> on >> inst1 right (I am saying if timing of inst0 idle work coincides with >> this), >> >> Or am I wrong in assuming this ? Can't this ever happen ? Please >> correct my understanding here. > The flow looks like: > > begin_use(inst) > emit_fence(inst) > end_use(inst) > > ...later > fence signals > ...later > work handler > > In begin_use we increment the global and per instance submission. > This protects the power gating and profile until end_use. In end use > we decrement the submissions because we don't need to protect anything > any more as we have the fence that was submitted via the ring. That > fence won't signal until the job is complete. Is a next begin_use always guaranteed to be run before current job fence signals ? if not then both total submission and total fence are zero , example delayed job/packet submissions from user space, or next job schedule happens after current job fence signals. if this is never possible then (v3) is perfect. Regards, Sathish > For power gating, we > only care about the submission count and fences for that instance, for > the profile, we care about submission count and fences all instances. > Once the fences have signalled, the outstanding fences will be 0 and > there won't be any active work. > > Alex > >> Regards, >> >> Sathish >> >>> Alex >>> >>>> Whereas if instances are powered off we are sure idle time is past >>>> and it is powered off, no possible way of active video decode, when >>>> all instances are off we can safely assume no active decode and >>>> global lock protects it against new begin_use on any instance. But >>>> the only distant concern is global common locks w.r.t perf, but we >>>> are already having a global workprofile mutex , so there shouldn't >>>> be any drop in perf, with just one single global lock for all >>>> instances. >>>> >>>> Just sending out a patch with this fix, will leave it to you to >>>> decide the right method. If you think outstanding total fences can >>>> never be 0 during decode, then your previous version (v3) itself is >>>> good, there is no real benefit of splitting the handlers as such. >>>> >>>> Regards, >>>> Sathish >>>>> If it is possible, maybe it would be easier to just split the >>>>> profile and powergating into separate handlers. The profile one >>>>> would be global and the powergating one would be per instance. >>>>> See the attached patches. >>>>> >>>>> Alex >>>>> >>>>>> can be a sure shot indication of break in a video decode, the >>>>>> mistake in my patch was using per instance mutex, I should have >>>>>> used a common global mutex, then that covers the situation David is >>>>>> trying to bring out. >>>>>> >>>>>> Using one global vcn.pg_lock for idle and begin_use and using >>>>>> flags to track power state could help us totally avoid this situation. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Sathish >>>>>> >>>>>> On 8/13/2025 11:46 PM, Wu, David wrote: >>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote: >>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <david...@amd.com> wrote: >>>>>>>>> Hi Alex, >>>>>>>>> >>>>>>>>> The addition of total_submission_cnt should work - in that it >>>>>>>>> is unlikely to have a context switch right after the begin_use(). >>>>>>>>> The suggestion of moving it inside the lock (which I prefer in >>>>>>>>> case someone adds more before the lock and not reviewed >>>>>>>>> thoroughly) >>>>>>>>> - up to you to decide. >>>>>>>>> >>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <david....@amd.com> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> David >>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote: >>>>>>>>>> If there are multiple instances of the VCN running, we may >>>>>>>>>> end up switching the video profile while another instance is >>>>>>>>>> active because we only take into account the current >>>>>>>>>> instance's submissions. Look at all outstanding fences for >>>>>>>>>> the video profile. >>>>>>>>>> >>>>>>>>>> v2: drop early exit in begin_use() >>>>>>>>>> v3: handle possible race between begin_use() work handler >>>>>>>>>> >>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile >>>>>>>>>> handling") >>>>>>>>>> Reviewed-by: Sathishkumar S >>>>>>>>>> <sathishkumar.sundarar...@amd.com> (v1) >>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40 >>>>>>>>>> ++++++++++++------------- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 + >>>>>>>>>> 2 files changed, 21 insertions(+), 20 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c >>>>>>>>>> @@ -415,19 +415,25 @@ static void >>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work) >>>>>>>>>> struct amdgpu_vcn_inst *vcn_inst = >>>>>>>>>> container_of(work, struct amdgpu_vcn_inst, >>>>>>>>>> idle_work.work); >>>>>>>>>> struct amdgpu_device *adev = vcn_inst->adev; >>>>>>>>>> - unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0}; >>>>>>>>>> - unsigned int i = vcn_inst->inst, j; >>>>>>>>>> + unsigned int total_fences = 0, >>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0}; >>>>>>>>>> + unsigned int i, j; >>>>>>>>>> int r = 0; >>>>>>>>>> >>>>>>>>>> - if (adev->vcn.harvest_config & (1 << i)) >>>>>>>>>> + if (adev->vcn.harvest_config & (1 << vcn_inst->inst)) >>>>>>>>>> return; >>>>>>>>>> >>>>>>>>>> - for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j) >>>>>>>>>> - fence[i] += >>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]); >>>>>>>>>> + for (i = 0; i < adev->vcn.num_vcn_inst; ++i) { >>>>>>>>>> + struct amdgpu_vcn_inst *v = &adev->vcn.inst[i]; >>>>>>>>>> + >>>>>>>>>> + for (j = 0; j < v->num_enc_rings; ++j) >>>>>>>>>> + fence[i] += >>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]); >>>>>>>>>> + fence[i] += amdgpu_fence_count_emitted(&v->ring_dec); >>>>>>>>>> + total_fences += fence[i]; >>>>>>>>>> + } >>>>>>>>>> >>>>>>>>>> /* Only set DPG pause for VCN3 or below, VCN4 and >>>>>>>>>> above will be handled by FW */ >>>>>>>>>> if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG && >>>>>>>>>> - !adev->vcn.inst[i].using_unified_queue) { >>>>>>>>>> + !vcn_inst->using_unified_queue) { >>>>>>>>>> struct dpg_pause_state new_state; >>>>>>>>>> >>>>>>>>>> if (fence[i] || @@ -436,18 +442,18 @@ >>>>>>>>>> static void amdgpu_vcn_idle_work_handler(struct work_struct >>>>>>>>>> *work) >>>>>>>>>> else >>>>>>>>>> new_state.fw_based = >>>>>>>>>> VCN_DPG_STATE__UNPAUSE; >>>>>>>>>> >>>>>>>>>> - adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state); >>>>>>>>>> + vcn_inst->pause_dpg_mode(vcn_inst, &new_state); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> - fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec); >>>>>>>>>> - fences += fence[i]; >>>>>>>>>> - >>>>>>>>>> - if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) { >>>>>>>>>> + if (!fence[vcn_inst->inst] && >>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) { >>>>>>>>>> + /* This is specific to this instance */ >>>>>>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); >>>>>>>>>> vcn_inst->set_pg_state(vcn_inst, >>>>>>>>>> AMD_PG_STATE_GATE); >>>>>>>>>> mutex_unlock(&vcn_inst->vcn_pg_lock); >>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); >>>>>>>>>> - if (adev->vcn.workload_profile_active) { >>>>>>>>>> + /* This is global and depends on all VCN instances */ >>>>>>>>>> + if (adev->vcn.workload_profile_active && >>>>>>>>>> !total_fences && >>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) { >>>>>>>>>> r = >>>>>>>>>> amdgpu_dpm_switch_power_profile(adev, >>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, >>>>>>>>>> false); >>>>>>>>>> if (r) @@ -467,16 +473,10 @@ void >>>>>>>>>> amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring) >>>>>>>>>> int r = 0; >>>>>>>>>> >>>>>>>>>> atomic_inc(&vcn_inst->total_submission_cnt); >>>>>>>>>> + atomic_inc(&adev->vcn.total_submission_cnt); >>>>>>>>> move this addition down inside the mutex lock >>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work); >>>>>>>>>> >>>>>>>>>> - /* We can safely return early here because we've cancelled the >>>>>>>>>> - * the delayed work so there is no one else to set it to false >>>>>>>>>> - * and we don't care if someone else sets it to true. >>>>>>>>>> - */ >>>>>>>>>> - if (adev->vcn.workload_profile_active) >>>>>>>>>> - goto pg_lock; >>>>>>>>>> - >>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); >>>>>>>>> move to here: >>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt); >>>>>>>>> I think this should work for multiple instances. >>>>>>>> Why does this need to be protected by the mutex? >>>>>>> hmm.. OK - no need and it is actually better before the mutex. >>>>>>> David >>>>>>>> Alex >>>>>>>> >>>>>>>>> David >>>>>>>>>> if (!adev->vcn.workload_profile_active) { >>>>>>>>>> r = amdgpu_dpm_switch_power_profile(adev, >>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, >>>>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct >>>>>>>>>> amdgpu_ring *ring) >>>>>>>>>> } >>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex); >>>>>>>>>> >>>>>>>>>> -pg_lock: >>>>>>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); >>>>>>>>>> vcn_inst->set_pg_state(vcn_inst, >>>>>>>>>> AMD_PG_STATE_UNGATE); >>>>>>>>>> >>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct >>>>>>>>>> amdgpu_ring >>>>>>>>>> *ring) >>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission >>>>>>>>>> _cnt); >>>>>>>>>> >>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_c >>>>>>>>>> nt); >>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt); >>>>>>>>>> >>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work, >>>>>>>>>> VCN_IDLE_TIMEOUT); diff --git >>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h >>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn { >>>>>>>>>> >>>>>>>>>> uint16_t inst_mask; >>>>>>>>>> uint8_t num_inst_per_aid; >>>>>>>>>> + atomic_t total_submission_cnt; >>>>>>>>>> >>>>>>>>>> /* IP reg dump */ >>>>>>>>>> uint32_t *ip_dump;
0001-drm-amdgpu-Check-vcn-state-before-profile-switch.patch
Description: 0001-drm-amdgpu-Check-vcn-state-before-profile-switch.patch