On 1/8/26 10:05, Tvrtko Ursulin wrote:
>
> On 07/01/2026 14:46, Tvrtko Ursulin wrote:
>>
>> On 07/01/2026 09:01, Christian König wrote:
>>> On 12/19/25 14:41, Tvrtko Ursulin wrote:
>>>> Struct amdgpu_ctx contains two copies of the pointer to the context
>>>> manager. Remove one.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <[email protected]>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +--
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 -
>>>> 2 files changed, 1 insertion(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/
>>>> drm/amd/amdgpu/amdgpu_ctx.c
>>>> index afedea02188d..41c05358d86d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -232,7 +232,7 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx
>>>> *ctx, u32 hw_ip,
>>>> } else {
>>>> struct amdgpu_fpriv *fpriv;
>>>> - fpriv = container_of(ctx->ctx_mgr, struct amdgpu_fpriv, ctx_mgr);
>>>> + fpriv = container_of(ctx->mgr, struct amdgpu_fpriv, ctx_mgr);
>>>> r = amdgpu_xcp_select_scheds(adev, hw_ip, hw_prio, fpriv,
>>>> &num_scheds, &scheds);
>>>
>>> Well that code is utterly nonsense to begin with.
>>> amdgpu_xcp_select_scheds() needs the xcp id to select from and not fpriv.
>>>
>>> Can you look into re-structuring this so that we don't need that cast?
>>
>> I had a look and so far only cleanup it up visually a bit so there is fewer
>> long array subscript dereferences and confusion between sel_xcp_id and
>> priv->xcp_id.
>>
>> But on a more fundamental level, since it needs to write to fpriv, the
>> caller will need to have it one way or the other, no?
>>
>> And then I noticed not only the atomic_read/inc usage is dodgy, but the
>> fpriv->xcp_id assignment itself is racy. Two threads submitting to the same
>> new entity appears can end up with a refcount imbalance and probably worse.
>>
>> Shall I replace the ref_cnt atomic with a mutex and protect the whole
>> selection?
>
> Or maybe there is no race?
>
> fpriv->xcp_id is first assigned in amdgpu_driver_open_kms() and there it
> looks it can mostly fail or succeed. I say mostly because the one silent
> failure path (not failing the device open) I see if xcp->ddev would be NULL.
> I am not sure if/when that can happen? If it can happen and that is the
> reason ctx init needs to retry the xcp_id selection? In which case it is racy.
xcp_id selection can only happen in amdgpu_driver_open_kms(), anytime later and
you have completely messed up page tables.
So no xcp code should ever overwrite fpriv->xcp_id. If we have stuff like that
then that is clearly a bug as well.
Thanks for pointing that out,
Christian.
>
> Regards,
>
> Tvrtko
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> if (r)
>>>> @@ -349,7 +349,6 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr,
>>>> int32_t priority,
>>>> else
>>>> ctx->stable_pstate = current_stable_pstate;
>>>> - ctx->ctx_mgr = &(fpriv->ctx_mgr);
>>>> return 0;
>>>> }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/
>>>> drm/amd/amdgpu/amdgpu_ctx.h
>>>> index aed758d0acaa..cf8d700a22fe 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> @@ -56,7 +56,6 @@ struct amdgpu_ctx {
>>>> unsigned long ras_counter_ce;
>>>> unsigned long ras_counter_ue;
>>>> struct amdgpu_ctx_mgr *mgr;
>>>> - struct amdgpu_ctx_mgr *ctx_mgr;
>>>> struct amdgpu_ctx_entity *entities[AMDGPU_HW_IP_NUM]
>>>> [AMDGPU_MAX_ENTITY_NUM];
>>>> };
>>>
>>
>