On Mon, 2025-12-08 at 13:39 -0800, Matthew Brost wrote:
> On Mon, Dec 08, 2025 at 08:43:03PM +0100, Christian König wrote:
> > On 12/8/25 20:02, Matthew Brost wrote:
> > > On Mon, Dec 08, 2025 at 10:58:42AM -0800, Matthew Brost wrote:
> > > > On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
> > > > > On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
> > > > > > On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
> > > > > > > On 12/1/25 10:04, Philipp Stanner wrote:
> > ....
> > > > > > > > This entire fix idea seems to circle around the concept of
> > > > > > > > relying yet
> > > > > > > > again on the scheduler's internal behavior (i.e., when it
> > > > > > > > schedules the
> > > > > > > > call to free_job()).
> > > > > > > >
> > > > > > > > I think we discussed that at XDC, but how I see it if drivers
> > > > > > > > have
> > > > > > > > strange job life time requirements where a job shall outlive
> > > > > > > > drm_sched's free_job() call, they must solve that with a proper
> > > > > > > > synchronization mechanism.
> > > > > > >
> > > > > > > Well that is not correct as far as I can see.
> > > > > > >
> > > > > > > The problem here is rather that the scheduler gives the job as
> > > > > > > parameter to the timedout_job() callback, but doesn't guarantee
> > > > > > > that ->free_job() callback isn't called while timedout_job() runs.
> > > > > > >
> > > > > > > This should be prevented by removing the job in question from the
> > > > > > > pending list (see drm_sched_job_timedout), but for some reason
> > > > > > > that doesn't seem to work correctly.
> > > > > > >
> > > > > >
> > > > > > Are you sure this is happening? It doesn’t seem possible, nor have I
> > > > > > observed it.
> > > > >
> > > > > It's impossible, isn't it?
> > > > >
> > > > > static void drm_sched_job_timedout(struct work_struct *work) { struct
> > > > > drm_gpu_scheduler *sched; struct drm_sched_job *job; enum
> > > > > drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET; sched =
> > > > > container_of(work, struct drm_gpu_scheduler, work_tdr.work); /*
> > > > > Protects against concurrent deletion in drm_sched_get_finished_job */
> > > > > spin_lock(&sched->job_list_lock); job =
> > > > > list_first_entry_or_null(&sched->pending_list, struct drm_sched_job,
> > > > > list); if (job) { /* * Remove the bad job so it cannot be freed by a
> > > > > concurrent * &struct drm_sched_backend_ops.free_job. It will be *
> > > > > reinserted after the scheduler's work items have been * cancelled, at
> > > > > which point it's safe. */ list_del_init(&job->list);
> > > > > spin_unlock(&sched->job_list_lock); status =
> > > > > job->sched->ops->timedout_job(job);
> > > > >
> > > > >
> > > > > 1. scheduler takes list lock
> > > > > 2. removes job from list
> > > > > 3. unlocks
> > > > > 4. calls timedout_job callback
> > > > >
> > > > >
> > > > > How can free_job_work, through drm_sched_get_finished_job(), get and
> > > > > free the same job?
> > > > >
> > > >
> > > > It can't.
> >
> > But exactly that happens somehow. Don't ask me how, I have no idea.
*Philipp refuses to elaborate and asks Christian*
How are you so sure about that's what's happening? Anyways, assuming it
is true:
> >
> > My educated guess is that the job somehow ends up on the pending list again.
then the obvious question would be: does amdgpu touch the pending_list
itself, or does it only ever modify it through proper scheduler APIs?
> >
> > > >
> > > > > The pending_list is probably the one place where we actually lock
> > > > > consistently and sanely.
> > > > >
> > > > > I think this needs to be debugged more intensively, Christian.
> > > > >
> > > > >
> > > > > >
> > > > > > What actually looks like a problem is that in
> > > > > > drm_sched_job_timedout,
> > > > > > free_job can be called. Look at [2]—if you’re using free_guilty (Xe
> > > > > > isn’t, but [2] was Xe trying to do the same thing), this is actually
> > > > > > unsafe. The free_guilty code should likely be removed as that
> > > > > > definitely
> > > > > > can explode under the right conditions.
> > > > >
> > > > > I'm right now not even sure why free_guilty exists, but I don't see
> > > > > how
> > > >
> > > > I'm sure why free_guilty exists either. If the fence is signaled in
> > > > timedout job free_job will get scheduled on another work_item.
> > > >
> > > > > it's illegal for drm_sched to call free_job in drm_sched_job_timedout?
> > > > >
> > > > > free_job can be called at any point in time, drivers must expect that.
> > > > > No lock is being held, and timedout_job already ran. So what's the
> > > > > problem?
> > > > >
> > > > > For drivers with additional refcounting it would be even less of a
> > > > > problem.
> > > > >
> > > >
> > > > No, the scheduler can still reference the job.
> > > >
> > > > 1265 fence = sched->ops->run_job(sched_job);
> > > > 1266 complete_all(&entity->entity_idle);
> > > > 1267 drm_sched_fence_scheduled(s_fence, fence);
> > > > 1268
> > > > 1269 if (!IS_ERR_OR_NULL(fence)) {
> > > > 1270 r = dma_fence_add_callback(fence, &sched_job->cb,
> > > > 1271 drm_sched_job_done_cb);
> > > > 1272 if (r == -ENOENT)
> > > > 1273 drm_sched_job_done(sched_job,
> > > > fence->error);
> > > > 1274 else if (r)
> > > > 1275 DRM_DEV_ERROR(sched->dev, "fence add
> > > > callback failed (%d)\n", r);
> > > > 1276
> > > > 1277 dma_fence_put(fence);
> > > > 1278 } else {
> > > > 1279 drm_sched_job_done(sched_job, IS_ERR(fence) ?
> > > > 1280 PTR_ERR(fence) : 0);
> > > > 1281 }
> > > > 1282
> > > > 1283 wake_up(&sched->job_scheduled);
> > > > 1284 drm_sched_run_job_queue(sched);
> > > >
> > > > At line 1269, the run_job work item is interrupted. Timed-out jobs run,
> > > > call free_job, which performs the final put. Then the run_job work item
> > > > resumes—and boom, UAF. Using the same reasoning, I think moving free_job
> > > > to the timed-out work queue could also cause issues.
> > > >
> > > > If run_job work item took a reference to the job before adding it to the
> > > > pending list and dropped it after it was done touching it in this
> > > > function, then yes, that would be safe. This is an argument for moving
> > > > reference counting into the base DRM scheduler class, it would make
> > >
> > > typo: s/DRM scheduler class/DRM job class
> >
> > That strongly sounds like re-inventing the scheduler fence.
> >
>
> Perhaps.
>
> > What if we completely drop the job object? Or merge it into the scheduler
> > fence?
> >
> > The fence has reference counting, proper state transitions and a well
> > defined lifetime.
> >
> > We would just need ->schedule and ->finished functions instead of ->run_job
> > and ->free_job. Those callbacks would then still be called by the scheduler
> > in work item context instead of the irq context of the dma_fence callbacks.
>
> Yes, definitely no IRQ contexts.
>
> >
> > The job can then be a void* in the scheduler fence where drivers can put
> > anything they want and also drivers control the lifetime of that. E.g. they
> > can free it during ->schedule as well as during ->finished.
> >
>
> I think this is a reasonable idea, but it would require major surgery
> across the subsystem plus the 11 upstream drivers I’m counting that use
> DRM scheduler. This would be a huge coordinated effort.
>
> So I see three options:
>
> 1. Rename free_job to put_job and document usage. Rip out free_guilty.
> Likely the easiest and least invasive.
I think I can live with that. It's work-effort wise the best we can do
right now. However, that does need proper documentation.
Let me respin to my documentation series and upstream that soonish,
than we can build on top of that.
P.
>
> 2. Move reference counting to the base DRM scheduler job object, provide a
> vfunc for the final job put, and document usage. Medium invasive.
>
> 3. Move job (driver) side tracking to the scheduler fence and let it
> control the lifetime. Very invasive.
>
> I’ll support any option, but my personal bandwidth to dive into
> something like #3 just isn’t there (of course, I can help review
> scheduler changes and fix up Xe, etc.).
>
> Matt
>
> > Christian.
> >
> > >
> > > Matt
> > >
> > > > ownership clear rather than relying on ordered work queues to keep
> > > > everything safe.
> > > >
> > > > > >
> > > > > > [2] git format-patch -1 ea2f6a77d0c40
> > > > > >
> > > > > > > > The first question would be: what does amdgpu need the job for
> > > > > > > > after
> > > > > > > > free_job() ran? What do you even need a job for still after
> > > > > > > > there was a
> > > > > > > > timeout?
> > > > > > >
> > > > > > > No, we just need the job structure alive as long as the
> > > > > > > timedout_job() callback is running.
> > > > > > >
> > > > > >
> > > > > > Yes, I agree.
> > > > >
> > > > > As far as I can see that's how it's already implemented? No one can
> > > > > free that job while timedout_job() is running in
> > > > > drm_sched_job_timedout().
> > > > >
> > > >
> > > > See above, free guility is still problematic.
> > > >
> > > > > >
> > > > > > > > And if you really still need its contents, can't you memcpy()
> > > > > > > > the job
> > > > > > > > or something?
> > > > > > > >
> > > > > > > > Assuming that it really needs it and that that cannot easily be
> > > > > > > > solved,
> > > > > > > > I suppose the obvious answer for differing memory life times is
> > > > > > > > refcounting. So amdgpu could just let drm_sched drop its
> > > > > > > > reference in
> > > > > > > > free_job(), and from then onward it's amdgpu's problem.
> > > > > > > >
> > > > > > > > I hope Matthew can educate us on how Xe does it.
> > > > > > >
> > > > > > > We discussed this on XDC and it was Matthew who brought up that
> > > > > > > this can be solved by running timeout and free worker on the same
> > > > > > > single threaded wq.
> > > > > > >
> > > > > >
> > > > > > No, see my explainations above. This is not my suggestion.
> > > > > >
> > > > > > > >
> > > > > > > > AFAIK Nouveau doesn't have that problem, because on timeout we
> > > > > > > > just
> > > > > > > > terminate the channel.
> > > > > > > >
> > > > > > > > Would also be interesting to hear whether other driver folks
> > > > > > > > have the
> > > > > > > > problem of free_job() being racy.
> > > > > > >
> > > > > > > I think that this is still a general problem with the drm
> > > > > > > scheduler and not driver specific at all.
> > > > > > >
> > > > > >
> > > > > > Maybe the free_guilty is likely a scheduler problem but I'm not
> > > > > > seeing
> > > > > > an issue aside from that.
> > > > >
> > > > > I also can't see the bug. I fail to see how drm_sched can free a job
> > > > > that's currently in use in timedout_job(). If that can happen,
> > > > > Christian, Vitaly, please point us to where and how. Only then can we
> > > > > decide on how to fix it properly.
> > > > >
> > > >
> > > > Again see above.
> > > >
> > > > Matt
> > > >
> > > > >
> > > > > P.
> > > > >
> > > > >
> >