On Fri, 2025-10-03 at 10:26 +0100, Tvrtko Ursulin wrote:
> Drm_sched_job_add_dependency() consumes the fence reference both on
s/D/d
> success and failure, so in the latter case the dma_fence_put() on the
> error path (xarray failed to expand) is a double free.
>
> Interestingly this bug appears to have been present ever since
> ebd5f74255b9 ("drm/sched: Add dependency tracking"), since the code back
> then looked like this:
>
> drm_sched_job_add_implicit_dependencies():
> ...
> for (i = 0; i < fence_count; i++) {
> ret = drm_sched_job_add_dependency(job, fences[i]);
> if (ret)
> break;
> }
>
> for (; i < fence_count; i++)
> dma_fence_put(fences[i]);
>
> Which means for the failing 'i' the dma_fence_put was already a double
> free. Possibly there were no users at that time, or the test cases were
> insufficient to hit it.
>
> The bug was then only noticed and fixed after
> 9c2ba265352a ("drm/scheduler: use new iterator in
> drm_sched_job_add_implicit_dependencies v2")
> landed, with its fixup of
> 4eaf02d6076c ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies").
>
> At that point it was a slightly different flavour of a double free, which
> 963d0b356935 ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies
> harder")
> noticed and attempted to fix.
>
> But it only moved the double free from happening inside the
> drm_sched_job_add_dependency(), when releasing the reference not yet
> obtained, to the caller, when releasing the reference already released by
> the former in the failure case.
That's certainly interesting, but is there a specific reason why you
include all of that?
The code is as is, and AFAICS it's just a bug stemming from original
bugs present and then refactorings happening.
I would at least remove the old 'implicit_dependencies' function from
the commit message. It's just confusing and makes one look for that in
the current code or patch.
>
> As such it is not easy to identify the right target for the fixes tag so
> lets keep it simple and just continue the chain.
>
> We also drop the misleading comment about additional reference, since it
> is not additional but the only one from the point of view of dependency
> tracking.
IMO that comment is nonsense. It's useless, too, because I can *see*
that a reference is being taken there, but not *why*.
Argh, these comments. See also my commit 72ebc18b34993
Anyways. Removing it is fine, but adding a better comment is better.
See below.
>
> Signed-off-by: Tvrtko Ursulin <[email protected]>
> Fixes: 963d0b356935 ("drm/scheduler: fix
> drm_sched_job_add_implicit_dependencies harder")
> Reported-by: Dan Carpenter <[email protected]>
Is there an error report that could be included here with a Closes:
tag?
> Cc: Christian König <[email protected]>
> Cc: Rob Clark <[email protected]>
> Cc: Daniel Vetter <[email protected]>
> Cc: Matthew Brost <[email protected]>
> Cc: Danilo Krummrich <[email protected]>
> Cc: Philipp Stanner <[email protected]>
> Cc: "Christian König" <[email protected]>
> Cc: [email protected]
> Cc: <[email protected]> # v5.16+
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 14 +++++---------
> 1 file changed, 5 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 46119aacb809..aff34240f230 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -960,20 +960,16 @@ int drm_sched_job_add_resv_dependencies(struct
> drm_sched_job *job,
> {
> struct dma_resv_iter cursor;
> struct dma_fence *fence;
> - int ret;
> + int ret = 0;
>
> dma_resv_assert_held(resv);
>
> dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> - /* Make sure to grab an additional ref on the added fence */
> - dma_fence_get(fence);
> - ret = drm_sched_job_add_dependency(job, fence);
> - if (ret) {
> - dma_fence_put(fence);
> - return ret;
> - }
> + ret = drm_sched_job_add_dependency(job, dma_fence_get(fence));
You still take a reference as before, but there is no comment anymore.
Can you add one explaining why a new reference is taken here?
I guess it will be something like "This needs a new reference for the
job", since you cannot rely on the one from resv.
> + if (ret)
> + break;
> }
> - return 0;
> + return ret;
That's an unnecessarily enlargement of the git diff because of style,
isn't it? Better keep the diff minimal here for git blame.
P.
> }
> EXPORT_SYMBOL(drm_sched_job_add_resv_dependencies);
>