amdgpu: Reject impossible entities early

Christian König Thu, 08 Jan 2026 05:13:20 -0800

On 1/7/26 13:43, Tvrtko Ursulin wrote:
> Currently there are two different behaviour modes when userspace tries to
> operate on not present HW IP blocks. On a machine without UVD, VCE and VPE
> blocks, this can be observed for example like this:
> 
> $ sudo ./amd_fuzzing --r cs-wait-fuzzing
> ...
> amd_cs_wait_fuzzing DRM_IOCTL_AMDGPU_CTX r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_GFX r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_COMPUTE r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_DMA r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_UVD r -1
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_VCE r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_UVD_ENC r -1
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_VCN_DEC r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_VCN_ENC r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_VCN_JPEG r 0
> amd_cs_wait_fuzzing AMDGPU_WAIT_CS AMD_IP_VPE r 0
> 
> We can see that UVD returns an errno (-EINVAL) from the CS_WAIT ioctl,
> while VCE and VPE return unexpected successes.
> 
> The difference stems from the fact the UVD is a load balancing engine
> which retains the context, so with a workaround implemented in
> amdgpu_ctx_init_entity(), but which does not account for the fact hardware
> block may not be present.
> 
> This causes a single NULL scheduler to be passed to
> drm_sched_entity_init(), which immediately rejects this with -EINVAL.
> 
> The not present VCE and VPE cases on the other hand pass zero schedulers
> to drm_sched_entity_init(), which is explicitly allowed and results in
> unusable entities.
> 
> As the UVD case however shows, call paths can handle the errors, so we can
> consolidate this into a single path which will always return -EINVAL if
> the HW IP block is not present.
> 
> We do this by rejecting it early and not calling drm_sched_entity_init()
> when there is no backing hardware.
> 
> This also removes the need for the drm_sched_entity_init() to handle the
> zero schedulers and NULL scheduler cases, which means that we can follow
> up by removing the special casing from the DRM scheduler.


Yeah I remember that Luben looked into that issue at some point as well.

IIRC the solution back then looked similar to this here, but we had to revert 
that at some point because it caused issues.

Could be that those issues are now solved, but even if not applying the patches 
and fixing the fallout is probably the more reasonable thing to do.

So Reviewed-by: Christian König <[email protected]> for the entire 
series.

Regards,
Christian.

> 
> Signed-off-by: Tvrtko Ursulin <[email protected]>
> References: f34e8bb7d6c6 ("drm/sched: fix null-ptr-deref in init entity")
> Cc: Christian König <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index afedea02188d..a5f85ea9fbb6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -239,6 +239,11 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx 
> *ctx, u32 hw_ip,
>                       goto error_free_entity;
>       }
>  
> +     if (num_scheds == 0) {
> +             r = -EINVAL;
> +             goto error_free_entity;
> +     }
> +
>       /* disable load balance if the hw engine retains context among 
> dependent jobs */
>       if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
>           hw_ip == AMDGPU_HW_IP_VCN_DEC ||

Re: [RFC 1/3] drm/amdgpu: Reject impossible entities early

Reply via email to