On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
> On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
> > On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> > > On 12/1/25 10:04, Philipp Stanner wrote:
> > > > 
> 
> […]
> 
> > > > It eliminates such a race *in amdgpu*. Other drivers might not have
> > > > that problem. I think Intel/Xe is refcounting jobs? Matthew?
> > > > 
> > 
> > We schedule device resets on the same work queue as the job timeout
> > queue, as well as operations like VF migrations, since conceptually they
> > are quite similar to device resets. In both cases, we have the
> > capability to stop all schedulers on the device and prevent any future
> > schedules from being created while these processes are ongoing.
> > 
> > Putting it all together: when either of these device-wide operations is
> > running, we know that no job timeouts will occur and no scheduler
> > operations (e.g., run_job, free_job, etc.) will race. I suggest that all
> > other drivers requiring device-wide operations follow this approach, as
> > it seems to work quite well. In other words, even if free_job is moved
> > to the timeout work queue, I’m fairly certain you would still schedule
> > device-wide operations on the timeout work queue and still stop all
> > schedulers before any device operation touches scheduler or queue state.
> 
> Thx for the explanation.
> 
> As far as I'm aware, everyone concerned already follows the (by now
> officially documented) approach of stopping drm_sched on reset.
> 
> We have never really documented very well when and why one should share
> the timeout_wq, though.
> 

I'm not sure if it's documented. When I started working on Xe, Boris had
recently added the timedout_wq, which, from my understanding, was
intended to schedule all reset-type events (I believe Sima suggested
this to me).

A lot of the things I did early in Xe weren’t the best from an upstream
community perspective. To be honest, I really didn’t know what I was
doing and typed a lot.

> Someone who thinks he understands that really well should document that
> in drm_sched_init_args.
> 

I can take a pass at this. Did you documentation ever land? Building
upon that would be good.

> > 
> > Therefore, I don’t believe Xe actually has a problem here.
> > 
> > > > > > - Matches the logical model: timeout selects guilty job and 
> > > > > > recovery,
> > > > > >   including freeing, is handled on one queue.
> > > > > > 
> > > > > > Cons / considerations:
> > > > > > - For users that don’t provide timeout_wq, free_job moves from the
> > > > > >   per-sched ordered queue to system_wq, which slightly changes
> > 
> > s/system_wq/system_percpu_wq
> > 
> > Ok, the system_percpu_wq doesn't actually for timeout_wq as that work
> > queue is reclaim unsafe. We probably should just remove that fallback in
> > the scheduler.
> 
> Which one, exactly?
> 

System work queues don’t work for either timedout_wq or submit_wq. In
both cases, we need the workqueue marked with WQ_RECLAIM and must ensure
that no GFP_KERNEL allocations are performed by any work item scheduled
on the workqueue, which is impossible to guarantee with any current
system workqueue.

> > 
> > > > > >   scheduling behaviour but keeps correctness.
> > > > > 
> > > > > We should probably avoid that and use a single ordered wq for submit, 
> > > > > timeout, free when the driver doesn't provide one.
> > 
> > Ah, yes agree. I typed the same thing above before reading this.
> 
> Why do we even provide a submit_wq?
> 
> The timeout_wq I still understand somewhat, since it can help ordering
> racing timeouts. But submit?
> 

We could probably get rid of submit_wq, but I don’t see a significant
upside in doing so.

> 
> > 
> 
> […]
> 
> > > > 
> > > > I'm not convinced that this is the right path forward.
> > > > 
> > > > The fundamental issue here is the gross design problem within drm/sched
> > > > that drivers *create* jobs, but the scheduler *frees* them at an
> > > > unpredictable point in time.
> > > 
> > > Yeah, can't agree more!
> > > 
> > 
> > The scheduler doesn’t free jobs; it simply drops a reference count.
> 
> In our discussions, it is astonishing me at times with what naturalness
> you make that statement.
> 
> Who has ever defined, who has ever documented that this is how the
> scheduler is supposed to be used?
> 
> Let's look at the docu:
> 
>  /** * @free_job: Called once the job's finished fence has been signaled * 
> and it's time to clean it up. */ void (*free_job)(struct drm_sched_job 
> *sched_job);
> 
> "it's time to clean it up"
> 

Again, this is definitely not documented. I quickly found that if you
directly free jobs in free_job(), and the job pusher touches it after
the push, things can explode with the right race condition. Thus, I
reasoned that jobs need to be reference-counted. Pretty much every
driver actually does this, with AMD being a notable exception, perhaps a
few others.

> 
> > free_job should be renamed, in my opinion, to reflect this. Once that
> > misnomer is fixed, the design actually makes sense.
> > 
> 
> How do you know that this is the design idea? Since when do you know?
> Why was it never documented? It's kind of important.
> 

See above—I reasoned this one on my own and wasn’t a great upstream
citizen with respect to cleaning up DRM scheduler code or documenting
it.

> >  The scheduler holds
> > a reference to the job until its fence signals and until it is removed
> > from internal tracking.
> > 
> 
> Let's look at tho documentation for drm_sched_job:
> 
>  * @entity: the entity to which this job belongs.
>  * @cb: the callback for the parent fence in s_fence.
>  *
>  * A job is created by the driver using drm_sched_job_init(), and
>  * should call drm_sched_entity_push_job() once it wants the scheduler
>  * to schedule the job.
>  */
> 
> Do you see any clue anywhere at all that jobs are supposed to be
> refcounted? Where is the scheduler's reference (refcount++) taken? That
> could maybe even be taken race-free by drm_sched (likely in
> drm_sched_entity_push_job()), but I strongly suspect that what you're
> hinting at is that the driver (i.e., Xe) takes and frees the refcount
> for drm_sched.
> 

No, agree this should documented something like this:

A job is created by the driver using drm_sched_job_init() and should
call drm_sched_entity_push_job() when it wants the scheduler to schedule
the job. Once drm_sched_entity_push_job() is called, the DRM scheduler
owns a driver-side reference to the job, which is released in the
put_job() driver callback.

The above is akin to the fence returned from run_job() transferring a
reference to the DRM scheduler, which it releases when the scheduler
fence is done with the parent. So, the concept was understood—albeit one
that wasn’t documented for a very long time.

> 
> >  Transferring ownership via reference counting is
> > actually quite common and well understood. The scheduler takes ownership
> > of a ref when the job is pushed to the scheduler.
> 
> s/takes/should take
> 
> Don't get me wrong, I *think* that your design suggestion to refcount
> jobs is probably (considering the circumstances we got ourselves into)
> correct. I think I would support porting drm_sched to refcounting jobs.
> 
> But your way to formulate it as an absolute truth is plainly wrong. It
> is not the design reality, many drivers don't just drop a refcount in
> free_job; and it's not documented.
> 

If it’s coming off as the absolute truth—that’s not my intent. I’m
trying to help make sense of the DRM scheduler design so that it works
for everyone.

> Moreover, I think if refcounting jobs were a reality free_job() could
> just be deleted very easily, since the scheduler could drop its
> refcount whenever it pleases, without calling back into the driver.
> 

Right now, all the reference counting of jobs is embedded in the
driver-side job objects. However, if we moved the reference count into
the base DRM scheduler object, this would work. Of course, drivers would
need an option to override the destruction function, as the final put on
the object could have other side effects—for example in Xe, putting
other objects and dropping the RPM reference. So, we’d need a vfunc
somewhere, and I’m unsure if surgery across subsystems and drivers is
worth it.

> > 
> > > > This entire fix idea seems to circle around the concept of relying yet
> > > > again on the scheduler's internal behavior (i.e., when it schedules the
> > > > call to free_job()).
> > > > 
> > > > I think we discussed that at XDC, but how I see it if drivers have
> > > > strange job life time requirements where a job shall outlive
> > > > drm_sched's free_job() call, they must solve that with a proper
> > > > synchronization mechanism.
> > > 
> > > Well that is not correct as far as I can see.
> > > 
> > > The problem here is rather that the scheduler gives the job as parameter 
> > > to the timedout_job() callback, but doesn't guarantee that ->free_job() 
> > > callback isn't called while timedout_job() runs.
> > > 
> > > This should be prevented by removing the job in question from the pending 
> > > list (see drm_sched_job_timedout), but for some reason that doesn't seem 
> > > to work correctly.
> > > 
> > 
> > Are you sure this is happening? It doesn’t seem possible, nor have I
> > observed it.
> 
> It's impossible, isn't it?
> 
> static void drm_sched_job_timedout(struct work_struct *work) { struct 
> drm_gpu_scheduler *sched; struct drm_sched_job *job; enum drm_gpu_sched_stat 
> status = DRM_GPU_SCHED_STAT_RESET; sched = container_of(work, struct 
> drm_gpu_scheduler, work_tdr.work); /* Protects against concurrent deletion in 
> drm_sched_get_finished_job */ spin_lock(&sched->job_list_lock); job = 
> list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); 
> if (job) { /* * Remove the bad job so it cannot be freed by a concurrent * 
> &struct drm_sched_backend_ops.free_job. It will be * reinserted after the 
> scheduler's work items have been * cancelled, at which point it's safe. */ 
> list_del_init(&job->list); spin_unlock(&sched->job_list_lock); status = 
> job->sched->ops->timedout_job(job);
> 
> 
>    1. scheduler takes list lock
>    2. removes job from list
>    3. unlocks
>    4. calls timedout_job callback
> 
> 
> How can free_job_work, through drm_sched_get_finished_job(), get and
> free the same job?
> 

It can't.

> The pending_list is probably the one place where we actually lock
> consistently and sanely.
> 
> I think this needs to be debugged more intensively, Christian.
> 
> 
> > 
> > What actually looks like a problem is that in drm_sched_job_timedout,
> > free_job can be called. Look at [2]—if you’re using free_guilty (Xe
> > isn’t, but [2] was Xe trying to do the same thing), this is actually
> > unsafe. The free_guilty code should likely be removed as that definitely
> > can explode under the right conditions.
> 
> I'm right now not even sure why free_guilty exists, but I don't see how

I'm sure why free_guilty exists either. If the fence is signaled in
timedout job free_job will get scheduled on another work_item.

> it's illegal for drm_sched to call free_job in drm_sched_job_timedout?
> 
> free_job can be called at any point in time, drivers must expect that.
> No lock is being held, and timedout_job already ran. So what's the
> problem?
> 
> For drivers with additional refcounting it would be even less of a
> problem.
> 

No, the scheduler can still reference the job.

1265         fence = sched->ops->run_job(sched_job);
1266         complete_all(&entity->entity_idle);
1267         drm_sched_fence_scheduled(s_fence, fence);
1268
1269         if (!IS_ERR_OR_NULL(fence)) {
1270                 r = dma_fence_add_callback(fence, &sched_job->cb,
1271                                            drm_sched_job_done_cb);
1272                 if (r == -ENOENT)
1273                         drm_sched_job_done(sched_job, fence->error);
1274                 else if (r)
1275                         DRM_DEV_ERROR(sched->dev, "fence add callback 
failed (%d)\n", r);
1276
1277                 dma_fence_put(fence);
1278         } else {
1279                 drm_sched_job_done(sched_job, IS_ERR(fence) ?
1280                                    PTR_ERR(fence) : 0);
1281         }
1282
1283         wake_up(&sched->job_scheduled);
1284         drm_sched_run_job_queue(sched);

At line 1269, the run_job work item is interrupted. Timed-out jobs run,
call free_job, which performs the final put. Then the run_job work item
resumes—and boom, UAF. Using the same reasoning, I think moving free_job
to the timed-out work queue could also cause issues.

If run_job work item took a reference to the job before adding it to the
pending list and dropped it after it was done touching it in this
function, then yes, that would be safe. This is an argument for moving
reference counting into the base DRM scheduler class, it would make
ownership clear rather than relying on ordered work queues to keep
everything safe.

> > 
> > [2] git format-patch -1 ea2f6a77d0c40
> > 
> > > > The first question would be: what does amdgpu need the job for after
> > > > free_job() ran? What do you even need a job for still after there was a
> > > > timeout?
> > > 
> > > No, we just need the job structure alive as long as the timedout_job() 
> > > callback is running.
> > > 
> > 
> > Yes, I agree.
> 
> As far as I can see that's how it's already implemented? No one can
> free that job while timedout_job() is running in
> drm_sched_job_timedout().
>

See above, free guility is still problematic.
 
> > 
> > > > And if you really still need its contents, can't you memcpy() the job
> > > > or something?
> > > > 
> > > > Assuming that it really needs it and that that cannot easily be solved,
> > > > I suppose the obvious answer for differing memory life times is
> > > > refcounting. So amdgpu could just let drm_sched drop its reference in
> > > > free_job(), and from then onward it's amdgpu's problem.
> > > > 
> > > > I hope Matthew can educate us on how Xe does it.
> > > 
> > > We discussed this on XDC and it was Matthew who brought up that this can 
> > > be solved by running timeout and free worker on the same single threaded 
> > > wq.
> > > 
> > 
> > No, see my explainations above. This is not my suggestion.
> > 
> > > > 
> > > > AFAIK Nouveau doesn't have that problem, because on timeout we just
> > > > terminate the channel.
> > > > 
> > > > Would also be interesting to hear whether other driver folks have the
> > > > problem of free_job() being racy.
> > > 
> > > I think that this is still a general problem with the drm scheduler and 
> > > not driver specific at all.
> > > 
> > 
> > Maybe the free_guilty is likely a scheduler problem but I'm not seeing
> > an issue aside from that.
> 
> I also can't see the bug. I fail to see how drm_sched can free a job
> that's currently in use in timedout_job(). If that can happen,
> Christian, Vitaly, please point us to where and how. Only then can we
> decide on how to fix it properly.
> 

Again see above.

Matt

> 
> P.
> 
> 

Reply via email to