On 1/8/26 10:05, Tvrtko Ursulin wrote:
> 
> On 07/01/2026 14:46, Tvrtko Ursulin wrote:
>>
>> On 07/01/2026 09:01, Christian König wrote:
>>> On 12/19/25 14:41, Tvrtko Ursulin wrote:
>>>> Struct amdgpu_ctx contains two copies of the pointer to the context
>>>> manager. Remove one.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <[email protected]>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 -
>>>>   2 files changed, 1 insertion(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/ 
>>>> drm/amd/amdgpu/amdgpu_ctx.c
>>>> index afedea02188d..41c05358d86d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -232,7 +232,7 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx 
>>>> *ctx, u32 hw_ip,
>>>>       } else {
>>>>           struct amdgpu_fpriv *fpriv;
>>>> -        fpriv = container_of(ctx->ctx_mgr, struct amdgpu_fpriv, ctx_mgr);
>>>> +        fpriv = container_of(ctx->mgr, struct amdgpu_fpriv, ctx_mgr);
>>>>           r = amdgpu_xcp_select_scheds(adev, hw_ip, hw_prio, fpriv,
>>>>                           &num_scheds, &scheds);
>>>
>>> Well that code is utterly nonsense to begin with. 
>>> amdgpu_xcp_select_scheds() needs the xcp id to select from and not fpriv.
>>>
>>> Can you look into re-structuring this so that we don't need that cast?
>>
>> I had a look and so far only cleanup it up visually a bit so there is fewer 
>> long array subscript dereferences and confusion between sel_xcp_id and 
>> priv->xcp_id.
>>
>> But on a more fundamental level, since it needs to write to fpriv, the 
>> caller will need to have it one way or the other, no?
>>
>> And then I noticed not only the atomic_read/inc usage is dodgy, but the 
>> fpriv->xcp_id assignment itself is racy. Two threads submitting to the same 
>> new entity appears can end up with a refcount imbalance and probably worse.
>>
>> Shall I replace the ref_cnt atomic with a mutex and protect the whole 
>> selection?
> 
> Or maybe there is no race?
> 
> fpriv->xcp_id is first assigned in amdgpu_driver_open_kms() and there it 
> looks it can mostly fail or succeed. I say mostly because the one silent 
> failure path (not failing the device open) I see if xcp->ddev would be NULL. 
> I am not sure if/when that can happen? If it can happen and that is the 
> reason ctx init needs to retry the xcp_id selection? In which case it is racy.

xcp_id selection can only happen in amdgpu_driver_open_kms(), anytime later and 
you have completely messed up page tables.

So no xcp code should ever overwrite fpriv->xcp_id. If we have stuff like that 
then that is clearly a bug as well.

Thanks for pointing that out,
Christian.

> 
> Regards,
> 
> Tvrtko
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>>           if (r)
>>>> @@ -349,7 +349,6 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, 
>>>> int32_t priority,
>>>>       else
>>>>           ctx->stable_pstate = current_stable_pstate;
>>>> -    ctx->ctx_mgr = &(fpriv->ctx_mgr);
>>>>       return 0;
>>>>   }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/ 
>>>> drm/amd/amdgpu/amdgpu_ctx.h
>>>> index aed758d0acaa..cf8d700a22fe 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>>> @@ -56,7 +56,6 @@ struct amdgpu_ctx {
>>>>       unsigned long            ras_counter_ce;
>>>>       unsigned long            ras_counter_ue;
>>>>       struct amdgpu_ctx_mgr        *mgr;
>>>> -    struct amdgpu_ctx_mgr        *ctx_mgr;
>>>>       struct amdgpu_ctx_entity    *entities[AMDGPU_HW_IP_NUM] 
>>>> [AMDGPU_MAX_ENTITY_NUM];
>>>>   };
>>>
>>
> 

Reply via email to