On Mon, Dec 08, 2025 at 10:58:42AM -0800, Matthew Brost wrote:
> On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
> > On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
> > > On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> > > > On 12/1/25 10:04, Philipp Stanner wrote:
> > > > > 
> > 
> > […]
> > 
> > > > > It eliminates such a race *in amdgpu*. Other drivers might not have
> > > > > that problem. I think Intel/Xe is refcounting jobs? Matthew?
> > > > > 
> > > 
> > > We schedule device resets on the same work queue as the job timeout
> > > queue, as well as operations like VF migrations, since conceptually they
> > > are quite similar to device resets. In both cases, we have the
> > > capability to stop all schedulers on the device and prevent any future
> > > schedules from being created while these processes are ongoing.
> > > 
> > > Putting it all together: when either of these device-wide operations is
> > > running, we know that no job timeouts will occur and no scheduler
> > > operations (e.g., run_job, free_job, etc.) will race. I suggest that all
> > > other drivers requiring device-wide operations follow this approach, as
> > > it seems to work quite well. In other words, even if free_job is moved
> > > to the timeout work queue, I’m fairly certain you would still schedule
> > > device-wide operations on the timeout work queue and still stop all
> > > schedulers before any device operation touches scheduler or queue state.
> > 
> > Thx for the explanation.
> > 
> > As far as I'm aware, everyone concerned already follows the (by now
> > officially documented) approach of stopping drm_sched on reset.
> > 
> > We have never really documented very well when and why one should share
> > the timeout_wq, though.
> > 
> 
> I'm not sure if it's documented. When I started working on Xe, Boris had
> recently added the timedout_wq, which, from my understanding, was
> intended to schedule all reset-type events (I believe Sima suggested
> this to me).
> 
> A lot of the things I did early in Xe weren’t the best from an upstream
> community perspective. To be honest, I really didn’t know what I was
> doing and typed a lot.
> 
> > Someone who thinks he understands that really well should document that
> > in drm_sched_init_args.
> > 
> 
> I can take a pass at this. Did you documentation ever land? Building
> upon that would be good.
> 
> > > 
> > > Therefore, I don’t believe Xe actually has a problem here.
> > > 
> > > > > > > - Matches the logical model: timeout selects guilty job and 
> > > > > > > recovery,
> > > > > > >   including freeing, is handled on one queue.
> > > > > > > 
> > > > > > > Cons / considerations:
> > > > > > > - For users that don’t provide timeout_wq, free_job moves from the
> > > > > > >   per-sched ordered queue to system_wq, which slightly changes
> > > 
> > > s/system_wq/system_percpu_wq
> > > 
> > > Ok, the system_percpu_wq doesn't actually for timeout_wq as that work
> > > queue is reclaim unsafe. We probably should just remove that fallback in
> > > the scheduler.
> > 
> > Which one, exactly?
> > 
> 
> System work queues don’t work for either timedout_wq or submit_wq. In
> both cases, we need the workqueue marked with WQ_RECLAIM and must ensure
> that no GFP_KERNEL allocations are performed by any work item scheduled
> on the workqueue, which is impossible to guarantee with any current
> system workqueue.
> 
> > > 
> > > > > > >   scheduling behaviour but keeps correctness.
> > > > > > 
> > > > > > We should probably avoid that and use a single ordered wq for 
> > > > > > submit, timeout, free when the driver doesn't provide one.
> > > 
> > > Ah, yes agree. I typed the same thing above before reading this.
> > 
> > Why do we even provide a submit_wq?
> > 
> > The timeout_wq I still understand somewhat, since it can help ordering
> > racing timeouts. But submit?
> > 
> 
> We could probably get rid of submit_wq, but I don’t see a significant
> upside in doing so.
> 
> > 
> > > 
> > 
> > […]
> > 
> > > > > 
> > > > > I'm not convinced that this is the right path forward.
> > > > > 
> > > > > The fundamental issue here is the gross design problem within 
> > > > > drm/sched
> > > > > that drivers *create* jobs, but the scheduler *frees* them at an
> > > > > unpredictable point in time.
> > > > 
> > > > Yeah, can't agree more!
> > > > 
> > > 
> > > The scheduler doesn’t free jobs; it simply drops a reference count.
> > 
> > In our discussions, it is astonishing me at times with what naturalness
> > you make that statement.
> > 
> > Who has ever defined, who has ever documented that this is how the
> > scheduler is supposed to be used?
> > 
> > Let's look at the docu:
> > 
> >  /** * @free_job: Called once the job's finished fence has been signaled * 
> > and it's time to clean it up. */ void (*free_job)(struct drm_sched_job 
> > *sched_job);
> > 
> > "it's time to clean it up"
> > 
> 
> Again, this is definitely not documented. I quickly found that if you
> directly free jobs in free_job(), and the job pusher touches it after
> the push, things can explode with the right race condition. Thus, I
> reasoned that jobs need to be reference-counted. Pretty much every
> driver actually does this, with AMD being a notable exception, perhaps a
> few others.
> 
> > 
> > > free_job should be renamed, in my opinion, to reflect this. Once that
> > > misnomer is fixed, the design actually makes sense.
> > > 
> > 
> > How do you know that this is the design idea? Since when do you know?
> > Why was it never documented? It's kind of important.
> > 
> 
> See above—I reasoned this one on my own and wasn’t a great upstream
> citizen with respect to cleaning up DRM scheduler code or documenting
> it.
> 
> > >  The scheduler holds
> > > a reference to the job until its fence signals and until it is removed
> > > from internal tracking.
> > > 
> > 
> > Let's look at tho documentation for drm_sched_job:
> > 
> >  * @entity: the entity to which this job belongs.
> >  * @cb: the callback for the parent fence in s_fence.
> >  *
> >  * A job is created by the driver using drm_sched_job_init(), and
> >  * should call drm_sched_entity_push_job() once it wants the scheduler
> >  * to schedule the job.
> >  */
> > 
> > Do you see any clue anywhere at all that jobs are supposed to be
> > refcounted? Where is the scheduler's reference (refcount++) taken? That
> > could maybe even be taken race-free by drm_sched (likely in
> > drm_sched_entity_push_job()), but I strongly suspect that what you're
> > hinting at is that the driver (i.e., Xe) takes and frees the refcount
> > for drm_sched.
> > 
> 
> No, agree this should documented something like this:
> 
> A job is created by the driver using drm_sched_job_init() and should
> call drm_sched_entity_push_job() when it wants the scheduler to schedule
> the job. Once drm_sched_entity_push_job() is called, the DRM scheduler
> owns a driver-side reference to the job, which is released in the
> put_job() driver callback.
> 
> The above is akin to the fence returned from run_job() transferring a
> reference to the DRM scheduler, which it releases when the scheduler
> fence is done with the parent. So, the concept was understood—albeit one
> that wasn’t documented for a very long time.
> 
> > 
> > >  Transferring ownership via reference counting is
> > > actually quite common and well understood. The scheduler takes ownership
> > > of a ref when the job is pushed to the scheduler.
> > 
> > s/takes/should take
> > 
> > Don't get me wrong, I *think* that your design suggestion to refcount
> > jobs is probably (considering the circumstances we got ourselves into)
> > correct. I think I would support porting drm_sched to refcounting jobs.
> > 
> > But your way to formulate it as an absolute truth is plainly wrong. It
> > is not the design reality, many drivers don't just drop a refcount in
> > free_job; and it's not documented.
> > 
> 
> If it’s coming off as the absolute truth—that’s not my intent. I’m
> trying to help make sense of the DRM scheduler design so that it works
> for everyone.
> 
> > Moreover, I think if refcounting jobs were a reality free_job() could
> > just be deleted very easily, since the scheduler could drop its
> > refcount whenever it pleases, without calling back into the driver.
> > 
> 
> Right now, all the reference counting of jobs is embedded in the
> driver-side job objects. However, if we moved the reference count into
> the base DRM scheduler object, this would work. Of course, drivers would
> need an option to override the destruction function, as the final put on
> the object could have other side effects—for example in Xe, putting
> other objects and dropping the RPM reference. So, we’d need a vfunc
> somewhere, and I’m unsure if surgery across subsystems and drivers is
> worth it.
> 
> > > 
> > > > > This entire fix idea seems to circle around the concept of relying yet
> > > > > again on the scheduler's internal behavior (i.e., when it schedules 
> > > > > the
> > > > > call to free_job()).
> > > > > 
> > > > > I think we discussed that at XDC, but how I see it if drivers have
> > > > > strange job life time requirements where a job shall outlive
> > > > > drm_sched's free_job() call, they must solve that with a proper
> > > > > synchronization mechanism.
> > > > 
> > > > Well that is not correct as far as I can see.
> > > > 
> > > > The problem here is rather that the scheduler gives the job as 
> > > > parameter to the timedout_job() callback, but doesn't guarantee that 
> > > > ->free_job() callback isn't called while timedout_job() runs.
> > > > 
> > > > This should be prevented by removing the job in question from the 
> > > > pending list (see drm_sched_job_timedout), but for some reason that 
> > > > doesn't seem to work correctly.
> > > > 
> > > 
> > > Are you sure this is happening? It doesn’t seem possible, nor have I
> > > observed it.
> > 
> > It's impossible, isn't it?
> > 
> > static void drm_sched_job_timedout(struct work_struct *work) { struct 
> > drm_gpu_scheduler *sched; struct drm_sched_job *job; enum 
> > drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET; sched = 
> > container_of(work, struct drm_gpu_scheduler, work_tdr.work); /* Protects 
> > against concurrent deletion in drm_sched_get_finished_job */ 
> > spin_lock(&sched->job_list_lock); job = 
> > list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); 
> > if (job) { /* * Remove the bad job so it cannot be freed by a concurrent * 
> > &struct drm_sched_backend_ops.free_job. It will be * reinserted after the 
> > scheduler's work items have been * cancelled, at which point it's safe. */ 
> > list_del_init(&job->list); spin_unlock(&sched->job_list_lock); status = 
> > job->sched->ops->timedout_job(job);
> > 
> > 
> >    1. scheduler takes list lock
> >    2. removes job from list
> >    3. unlocks
> >    4. calls timedout_job callback
> > 
> > 
> > How can free_job_work, through drm_sched_get_finished_job(), get and
> > free the same job?
> > 
> 
> It can't.
> 
> > The pending_list is probably the one place where we actually lock
> > consistently and sanely.
> > 
> > I think this needs to be debugged more intensively, Christian.
> > 
> > 
> > > 
> > > What actually looks like a problem is that in drm_sched_job_timedout,
> > > free_job can be called. Look at [2]—if you’re using free_guilty (Xe
> > > isn’t, but [2] was Xe trying to do the same thing), this is actually
> > > unsafe. The free_guilty code should likely be removed as that definitely
> > > can explode under the right conditions.
> > 
> > I'm right now not even sure why free_guilty exists, but I don't see how
> 
> I'm sure why free_guilty exists either. If the fence is signaled in
> timedout job free_job will get scheduled on another work_item.
> 
> > it's illegal for drm_sched to call free_job in drm_sched_job_timedout?
> > 
> > free_job can be called at any point in time, drivers must expect that.
> > No lock is being held, and timedout_job already ran. So what's the
> > problem?
> > 
> > For drivers with additional refcounting it would be even less of a
> > problem.
> > 
> 
> No, the scheduler can still reference the job.
> 
> 1265         fence = sched->ops->run_job(sched_job);
> 1266         complete_all(&entity->entity_idle);
> 1267         drm_sched_fence_scheduled(s_fence, fence);
> 1268
> 1269         if (!IS_ERR_OR_NULL(fence)) {
> 1270                 r = dma_fence_add_callback(fence, &sched_job->cb,
> 1271                                            drm_sched_job_done_cb);
> 1272                 if (r == -ENOENT)
> 1273                         drm_sched_job_done(sched_job, fence->error);
> 1274                 else if (r)
> 1275                         DRM_DEV_ERROR(sched->dev, "fence add callback 
> failed (%d)\n", r);
> 1276
> 1277                 dma_fence_put(fence);
> 1278         } else {
> 1279                 drm_sched_job_done(sched_job, IS_ERR(fence) ?
> 1280                                    PTR_ERR(fence) : 0);
> 1281         }
> 1282
> 1283         wake_up(&sched->job_scheduled);
> 1284         drm_sched_run_job_queue(sched);
> 
> At line 1269, the run_job work item is interrupted. Timed-out jobs run,
> call free_job, which performs the final put. Then the run_job work item
> resumes—and boom, UAF. Using the same reasoning, I think moving free_job
> to the timed-out work queue could also cause issues.
> 
> If run_job work item took a reference to the job before adding it to the
> pending list and dropped it after it was done touching it in this
> function, then yes, that would be safe. This is an argument for moving
> reference counting into the base DRM scheduler class, it would make

typo: s/DRM scheduler class/DRM job class

Matt

> ownership clear rather than relying on ordered work queues to keep
> everything safe.
> 
> > > 
> > > [2] git format-patch -1 ea2f6a77d0c40
> > > 
> > > > > The first question would be: what does amdgpu need the job for after
> > > > > free_job() ran? What do you even need a job for still after there was 
> > > > > a
> > > > > timeout?
> > > > 
> > > > No, we just need the job structure alive as long as the timedout_job() 
> > > > callback is running.
> > > > 
> > > 
> > > Yes, I agree.
> > 
> > As far as I can see that's how it's already implemented? No one can
> > free that job while timedout_job() is running in
> > drm_sched_job_timedout().
> >
> 
> See above, free guility is still problematic.
>  
> > > 
> > > > > And if you really still need its contents, can't you memcpy() the job
> > > > > or something?
> > > > > 
> > > > > Assuming that it really needs it and that that cannot easily be 
> > > > > solved,
> > > > > I suppose the obvious answer for differing memory life times is
> > > > > refcounting. So amdgpu could just let drm_sched drop its reference in
> > > > > free_job(), and from then onward it's amdgpu's problem.
> > > > > 
> > > > > I hope Matthew can educate us on how Xe does it.
> > > > 
> > > > We discussed this on XDC and it was Matthew who brought up that this 
> > > > can be solved by running timeout and free worker on the same single 
> > > > threaded wq.
> > > > 
> > > 
> > > No, see my explainations above. This is not my suggestion.
> > > 
> > > > > 
> > > > > AFAIK Nouveau doesn't have that problem, because on timeout we just
> > > > > terminate the channel.
> > > > > 
> > > > > Would also be interesting to hear whether other driver folks have the
> > > > > problem of free_job() being racy.
> > > > 
> > > > I think that this is still a general problem with the drm scheduler and 
> > > > not driver specific at all.
> > > > 
> > > 
> > > Maybe the free_guilty is likely a scheduler problem but I'm not seeing
> > > an issue aside from that.
> > 
> > I also can't see the bug. I fail to see how drm_sched can free a job
> > that's currently in use in timedout_job(). If that can happen,
> > Christian, Vitaly, please point us to where and how. Only then can we
> > decide on how to fix it properly.
> > 
> 
> Again see above.
> 
> Matt
> 
> > 
> > P.
> > 
> > 

Reply via email to