Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Philipp Stanner Wed, 10 Dec 2025 01:58:48 -0800

On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote:
> On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote:
> 
> Hey all, apologies for the stream of replies today. I looked into this a
> bit more after finishing up my day job and having several “WTF” moments.
> Trying to unify from this reply [1].
> 
> [1] 
> https://patchwork.freedesktop.org/patch/691101/?series=158224&rev=1#comment_1273453
>  
> 
> > On 12/9/25 14:51, Philipp Stanner wrote:
> >


[…]

> > > > 
> > > > My educated guess is that drm_sched_stop() inserted the job back into 
> > > > the pending list, but I still have no idea how it is possible that 
> > > > free_job is running after the scheduler is stopped.
> 
> I believe I found your problem, referencing amdgpu/amdgpu_device.c here.
> 
> 6718                 if (job)
> 6719                         ti = amdgpu_vm_get_task_info_pasid(adev, 
> job->pasid);
> 
> Which is after:
> 
> 6695 skip_hw_reset:
> 6696         r = amdgpu_device_sched_resume(&device_list, reset_context, 
> job_signaled);
> 6697         if (r)
> 6698                 goto reset_unlock;
> 
> The job is likely added back into this free list here:
> 
> 6676         amdgpu_device_halt_activities(adev, job, reset_context, 
> &device_list,
> 6677                                       hive, need_emergency_restart);
> 
> So free_job runs and 'job->pasid' explodes.
> 
> Save off the pasid on the stack at top of this function and I suspect
> your UAF goes away. This won't untangle this hair ball of code but I
> believe this at least prevent explosions.
> 
> But let’s dig in further—amdgpu_device_halt_activities calls
> drm_sched_stop (Xe just calls drm_sched_wqueue_stop for reference). This
> function stops the work items, then adds the offending job back to the
> pending list, iterates over each job, removes the CB, leaving the job in
> the pending list. If the CB can be removed, it removes the job from
> pending, maybe calls free_job if it’s not a guilty job, and if it is a
> guilty job, defers the free_job to the timed-out job so it doesn’t
> disappear. Like WTF?
> 
> Oh, it gets better—amdgpu_device_sched_resume calls drm_sched_start,
> which iterates over the pending list and reinserts the same CB that
> drm_sched_stop removed, then starts the scheduler. So the guilty job had
> its CB successfully removed, and now it can immediately disappear—also
> like WTF?
> 
> Free_guilty is clearly a hack around the job not being reference
> counted, and it doesn’t even work in some cases. Putting that
> aside, I think calling free_job shouldn’t really ever happen in TDR.
> Imagine if drm_sched_job_timedout just took a reference to the job like
> normal kernel code—free_guilty could be dropped, and immediately this
> all becomes safe. Likewise, if the run_job work item had a reference to
> the job, which it takes before adding to the pending list and drops
> after it’s done touching it in this function, then run_job and free_job
> work items could safely execute in parallel rather than relying on an
> ordered workqueue to keep that part of the code safe.

I can tell you how I design it in our Rust jobqueue:
Drivers create jobs, and in submit_job() the pass ownership over the
job to the jobqueue – IOW after pushing a job, a driver can't access it
anymore. In the run_job() callback, the jobqueue either passes the job
back by value (ownership) or borrows the job to the driver so that it
can be copied (this is done so that the JQ can hypothetically do
resubmits).

This way there is no need for refcounting (in Rust / jobqueue).

Maybe the core of the problem is not so much the lack of refcounting,
but the lack of ownership rules. Why even would the driver need the job
still after it got pushed? It should be fire-and-forget.

> 
> > > > 
> > > 
> > > And my uneducated guess is that it's happening in amdgpu. It seems a
> > > sched_job lives inside an amdgpu_job. Can the latter be freed at other
> > > places than free_job()?
> > 
> > > > 

[…]

> > > > It basically says to the driver that the job lifetime problems created 
> > > > by the scheduler is the driver problem and need to be worked around 
> > > > there.
> > > > 
> > > 
> > > My POV still mostly is that (with the current design) the driver must
> > > not use jobs after free_job() was invoked. And when that happens is
> > > unpredictable.
> > > 
> 
> This is somewhat of an absurd statement from my point of view. I have a
> valid job pointer, then I call another function (see above for an
> example of how drm_sched_start/stop is unsafe) and it disappears behind
> my back.
> 

The statement is absurd because reality (the code) is absurd. We all
are basically Alice in Wonderland, running as fast as we can just to
remain on the same spot ^_^

What I am stating is not that this is *good*, but this is what it
currently is like. Whether we like it or not.

The misunderstanding you and I might have is that for me jobs having to
be refcounted is not a reality until it's reflected in code,
documentation and, ideally, drivers.

>  The safe way to handle this is to take a local reference before
> doing anything that could make it disappear. That is much more
> reasonable than saying, “we have five things you can do in the
> scheduler, and if you do any of them it isn’t safe to touch the job
> afterward.”

Yeah, but that's drm_sched being drm_scheddy. Before I documented it
there were also these implicit refcounting rules in run_job(), where
the driver needs to take the reference for the scheduler for it to be
non-racy.

It also wasn't documented for a long time that drm_sched (through
spsc_queue) will explode if you don't use entities with a single
producer thread.

drm_sched got here because of gross design mistakes, lots of hacks for
few drivers, and, particularly, a strange aversion¹ against writing
documentation. If Xe came, back in the day, to the conclusion that job
lifetimes are fundamentally broken and that the objectively correct way
to solve this is refcounting, then why wasn't that pushed into
drm_sched back then?

> 
> > > To be unfair, we already have strange refcount expectations already.
> > > But I sort of agree that there is no objectively good solution in
> > > sight.
> > > 
> > > > > 
> > > > > Let me respin to my documentation series and upstream that soonish,
> > > > > than we can build on top of that.
> > > 
> > > > > 
> > > > > 
> > > > > P.
> > > > > 
> > > > > > 
> > > > > > 2. Move reference counting to the base DRM scheduler job object, 
> > > > > > provide a
> > > > > >    vfunc for the final job put, and document usage. Medium invasive.
> > > > 
> > > > I strongly think that reference counting the job object just because 
> > > > the scheduler needs it is a bad idea.
> > > > 
> > > > With that we are just moving the hot potato from one side of our mouth 
> > > > to the other without really solving the issue.
> > > > 
> 
> See above—I can’t say I agree with this assessment. I think the lack of
> reference counting is exactly the problem. I don’t really understand the
> pushback on a very well-understood concept (reference counting) in
> Linux. I would sign up to fix the entire subsystem if we go this route.
> 
> > > > If a driver like XE needs that for some reason than that is perfectly 
> > > > fine.
> > > 
> > > Nouveau doesn't need it either.
> > > 
> > > > 
> > > > > > 3. Move job (driver) side tracking to the scheduler fence and let it
> > > > > >    control the lifetime. Very invasive.
> > > > 
> > > > Thinking about it more that is actually not so much of a problem.
> > > > 
> > > > Let me try to code something together by the end of next week or so.
> > > 
> > > The hero Gotham needs :)
> > > 
> 
> Are you sure about this one? I think unless the problems around
> drm_sched_start/stop and free_guilty are fixed, my feeling is this
> entire thing is still badly broken for anyone who wants to use those.
> 
> To sum up this whole email: I strongly disagree with option #3, but if
> that is the consensus, I will, of course, support the effort.


I would like to discuss those topics with Danilo, too, who returns from
LPC soonish. Also to get some more insights into Nouveau and our use-
cases.

My suggestion is that we pick the conversation up again soonish.
Christmas is around the corner, and I suppose we can't fix this all up
in 2025 anyways, so we might want to give it a fresh re-spin in '26.


Greetings,
P.



[¹] The strangest aversion in drm_sched, however, is the one against
locking. Locks were only taken when *absolutely* necessary. It's as if
the entire code was designed by some gamers who want to show their
youtube subscribers that they can get 1fps more by changing RAM timings
in the bios.
drm_sched might be the component in the kernel using the most
synchronization mechanisms: Spinlocks, RCU, atomic integers, atomic
instructions, refcounting, READ_ONCE(), just accessing locklessly… even
Paul McKenney would get wide eyes here. The only thing we're still
missing is seqlocks and rw_locks, but maybe we can add those /s

That's likely sth we can't get repaired at all anymore.

Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Reply via email to