On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
> On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> > On 12/1/25 10:04, Philipp Stanner wrote:
> > >
[…]
> > > It eliminates such a race *in amdgpu*. Other drivers might not have
> > > that problem. I think Intel/Xe is refcounting jobs? Matthew?
> > >
>
> We schedule device resets on the same work queue as the job timeout
> queue, as well as operations like VF migrations, since conceptually they
> are quite similar to device resets. In both cases, we have the
> capability to stop all schedulers on the device and prevent any future
> schedules from being created while these processes are ongoing.
>
> Putting it all together: when either of these device-wide operations is
> running, we know that no job timeouts will occur and no scheduler
> operations (e.g., run_job, free_job, etc.) will race. I suggest that all
> other drivers requiring device-wide operations follow this approach, as
> it seems to work quite well. In other words, even if free_job is moved
> to the timeout work queue, I’m fairly certain you would still schedule
> device-wide operations on the timeout work queue and still stop all
> schedulers before any device operation touches scheduler or queue state.
Thx for the explanation.
As far as I'm aware, everyone concerned already follows the (by now
officially documented) approach of stopping drm_sched on reset.
We have never really documented very well when and why one should share
the timeout_wq, though.
Someone who thinks he understands that really well should document that
in drm_sched_init_args.
>
> Therefore, I don’t believe Xe actually has a problem here.
>
> > > > > - Matches the logical model: timeout selects guilty job and recovery,
> > > > > including freeing, is handled on one queue.
> > > > >
> > > > > Cons / considerations:
> > > > > - For users that don’t provide timeout_wq, free_job moves from the
> > > > > per-sched ordered queue to system_wq, which slightly changes
>
> s/system_wq/system_percpu_wq
>
> Ok, the system_percpu_wq doesn't actually for timeout_wq as that work
> queue is reclaim unsafe. We probably should just remove that fallback in
> the scheduler.
Which one, exactly?
>
> > > > > scheduling behaviour but keeps correctness.
> > > >
> > > > We should probably avoid that and use a single ordered wq for submit,
> > > > timeout, free when the driver doesn't provide one.
>
> Ah, yes agree. I typed the same thing above before reading this.
Why do we even provide a submit_wq?
The timeout_wq I still understand somewhat, since it can help ordering
racing timeouts. But submit?
>
[…]
> > >
> > > I'm not convinced that this is the right path forward.
> > >
> > > The fundamental issue here is the gross design problem within drm/sched
> > > that drivers *create* jobs, but the scheduler *frees* them at an
> > > unpredictable point in time.
> >
> > Yeah, can't agree more!
> >
>
> The scheduler doesn’t free jobs; it simply drops a reference count.
In our discussions, it is astonishing me at times with what naturalness
you make that statement.
Who has ever defined, who has ever documented that this is how the
scheduler is supposed to be used?
Let's look at the docu:
/** * @free_job: Called once the job's finished fence has been signaled * and
it's time to clean it up. */ void (*free_job)(struct drm_sched_job *sched_job);
"it's time to clean it up"
> free_job should be renamed, in my opinion, to reflect this. Once that
> misnomer is fixed, the design actually makes sense.
>
How do you know that this is the design idea? Since when do you know?
Why was it never documented? It's kind of important.
> The scheduler holds
> a reference to the job until its fence signals and until it is removed
> from internal tracking.
>
Let's look at tho documentation for drm_sched_job:
* @entity: the entity to which this job belongs.
* @cb: the callback for the parent fence in s_fence.
*
* A job is created by the driver using drm_sched_job_init(), and
* should call drm_sched_entity_push_job() once it wants the scheduler
* to schedule the job.
*/
Do you see any clue anywhere at all that jobs are supposed to be
refcounted? Where is the scheduler's reference (refcount++) taken? That
could maybe even be taken race-free by drm_sched (likely in
drm_sched_entity_push_job()), but I strongly suspect that what you're
hinting at is that the driver (i.e., Xe) takes and frees the refcount
for drm_sched.
> Transferring ownership via reference counting is
> actually quite common and well understood. The scheduler takes ownership
> of a ref when the job is pushed to the scheduler.
s/takes/should take
Don't get me wrong, I *think* that your design suggestion to refcount
jobs is probably (considering the circumstances we got ourselves into)
correct. I think I would support porting drm_sched to refcounting jobs.
But your way to formulate it as an absolute truth is plainly wrong. It
is not the design reality, many drivers don't just drop a refcount in
free_job; and it's not documented.
Moreover, I think if refcounting jobs were a reality free_job() could
just be deleted very easily, since the scheduler could drop its
refcount whenever it pleases, without calling back into the driver.
>
> > > This entire fix idea seems to circle around the concept of relying yet
> > > again on the scheduler's internal behavior (i.e., when it schedules the
> > > call to free_job()).
> > >
> > > I think we discussed that at XDC, but how I see it if drivers have
> > > strange job life time requirements where a job shall outlive
> > > drm_sched's free_job() call, they must solve that with a proper
> > > synchronization mechanism.
> >
> > Well that is not correct as far as I can see.
> >
> > The problem here is rather that the scheduler gives the job as parameter to
> > the timedout_job() callback, but doesn't guarantee that ->free_job()
> > callback isn't called while timedout_job() runs.
> >
> > This should be prevented by removing the job in question from the pending
> > list (see drm_sched_job_timedout), but for some reason that doesn't seem to
> > work correctly.
> >
>
> Are you sure this is happening? It doesn’t seem possible, nor have I
> observed it.
It's impossible, isn't it?
static void drm_sched_job_timedout(struct work_struct *work) { struct
drm_gpu_scheduler *sched; struct drm_sched_job *job; enum drm_gpu_sched_stat
status = DRM_GPU_SCHED_STAT_RESET; sched = container_of(work, struct
drm_gpu_scheduler, work_tdr.work); /* Protects against concurrent deletion in
drm_sched_get_finished_job */ spin_lock(&sched->job_list_lock); job =
list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); if
(job) { /* * Remove the bad job so it cannot be freed by a concurrent * &struct
drm_sched_backend_ops.free_job. It will be * reinserted after the scheduler's
work items have been * cancelled, at which point it's safe. */
list_del_init(&job->list); spin_unlock(&sched->job_list_lock); status =
job->sched->ops->timedout_job(job);
1. scheduler takes list lock
2. removes job from list
3. unlocks
4. calls timedout_job callback
How can free_job_work, through drm_sched_get_finished_job(), get and
free the same job?
The pending_list is probably the one place where we actually lock
consistently and sanely.
I think this needs to be debugged more intensively, Christian.
>
> What actually looks like a problem is that in drm_sched_job_timedout,
> free_job can be called. Look at [2]—if you’re using free_guilty (Xe
> isn’t, but [2] was Xe trying to do the same thing), this is actually
> unsafe. The free_guilty code should likely be removed as that definitely
> can explode under the right conditions.
I'm right now not even sure why free_guilty exists, but I don't see how
it's illegal for drm_sched to call free_job in drm_sched_job_timedout?
free_job can be called at any point in time, drivers must expect that.
No lock is being held, and timedout_job already ran. So what's the
problem?
For drivers with additional refcounting it would be even less of a
problem.
>
> [2] git format-patch -1 ea2f6a77d0c40
>
> > > The first question would be: what does amdgpu need the job for after
> > > free_job() ran? What do you even need a job for still after there was a
> > > timeout?
> >
> > No, we just need the job structure alive as long as the timedout_job()
> > callback is running.
> >
>
> Yes, I agree.
As far as I can see that's how it's already implemented? No one can
free that job while timedout_job() is running in
drm_sched_job_timedout().
>
> > > And if you really still need its contents, can't you memcpy() the job
> > > or something?
> > >
> > > Assuming that it really needs it and that that cannot easily be solved,
> > > I suppose the obvious answer for differing memory life times is
> > > refcounting. So amdgpu could just let drm_sched drop its reference in
> > > free_job(), and from then onward it's amdgpu's problem.
> > >
> > > I hope Matthew can educate us on how Xe does it.
> >
> > We discussed this on XDC and it was Matthew who brought up that this can be
> > solved by running timeout and free worker on the same single threaded wq.
> >
>
> No, see my explainations above. This is not my suggestion.
>
> > >
> > > AFAIK Nouveau doesn't have that problem, because on timeout we just
> > > terminate the channel.
> > >
> > > Would also be interesting to hear whether other driver folks have the
> > > problem of free_job() being racy.
> >
> > I think that this is still a general problem with the drm scheduler and not
> > driver specific at all.
> >
>
> Maybe the free_guilty is likely a scheduler problem but I'm not seeing
> an issue aside from that.
I also can't see the bug. I fail to see how drm_sched can free a job
that's currently in use in timedout_job(). If that can happen,
Christian, Vitaly, please point us to where and how. Only then can we
decide on how to fix it properly.
P.