On 08/01/2026 09:15, Christian König wrote:
On 1/8/26 10:05, Tvrtko Ursulin wrote:
On 07/01/2026 14:46, Tvrtko Ursulin wrote:
On 07/01/2026 09:01, Christian König wrote:
On 12/19/25 14:41, Tvrtko Ursulin wrote:
Struct amdgpu_ctx contains two copies of the pointer to the context
manager. Remove one.
Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 -
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/
drm/amd/amdgpu/amdgpu_ctx.c
index afedea02188d..41c05358d86d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -232,7 +232,7 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx,
u32 hw_ip,
} else {
struct amdgpu_fpriv *fpriv;
- fpriv = container_of(ctx->ctx_mgr, struct amdgpu_fpriv, ctx_mgr);
+ fpriv = container_of(ctx->mgr, struct amdgpu_fpriv, ctx_mgr);
r = amdgpu_xcp_select_scheds(adev, hw_ip, hw_prio, fpriv,
&num_scheds, &scheds);
Well that code is utterly nonsense to begin with. amdgpu_xcp_select_scheds()
needs the xcp id to select from and not fpriv.
Can you look into re-structuring this so that we don't need that cast?
I had a look and so far only cleanup it up visually a bit so there is fewer long
array subscript dereferences and confusion between sel_xcp_id and priv->xcp_id.
But on a more fundamental level, since it needs to write to fpriv, the caller
will need to have it one way or the other, no?
And then I noticed not only the atomic_read/inc usage is dodgy, but the
fpriv->xcp_id assignment itself is racy. Two threads submitting to the same new
entity appears can end up with a refcount imbalance and probably worse.
Shall I replace the ref_cnt atomic with a mutex and protect the whole selection?
Or maybe there is no race?
fpriv->xcp_id is first assigned in amdgpu_driver_open_kms() and there it looks it
can mostly fail or succeed. I say mostly because the one silent failure path (not
failing the device open) I see if xcp->ddev would be NULL. I am not sure if/when
that can happen? If it can happen and that is the reason ctx init needs to retry the
xcp_id selection? In which case it is racy.
xcp_id selection can only happen in amdgpu_driver_open_kms(), anytime later and
you have completely messed up page tables.
So no xcp code should ever overwrite fpriv->xcp_id. If we have stuff like that
then that is clearly a bug as well.
Thanks for pointing that out,
Only if post open fpriv->xcp_id ends up AMDGPU_XCP_NO_PARTITION due
finding an xcp with xcp->ddev == NULL. Again, I am not sure if/when that
can happen. But if it does, next ctx entity init will retry via
amdgpu_xcp_select_scheds.
Should I keep this patch as is for now and re-sent the series with the
other updates? IMO it is still worth dropping the duplicate member even
if it is followed up by more work in this area.
Regards,
Tvrtko
if (r)
@@ -349,7 +349,6 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr,
int32_t priority,
else
ctx->stable_pstate = current_stable_pstate;
- ctx->ctx_mgr = &(fpriv->ctx_mgr);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/
drm/amd/amdgpu/amdgpu_ctx.h
index aed758d0acaa..cf8d700a22fe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
@@ -56,7 +56,6 @@ struct amdgpu_ctx {
unsigned long ras_counter_ce;
unsigned long ras_counter_ue;
struct amdgpu_ctx_mgr *mgr;
- struct amdgpu_ctx_mgr *ctx_mgr;
struct amdgpu_ctx_entity *entities[AMDGPU_HW_IP_NUM]
[AMDGPU_MAX_ENTITY_NUM];
};