On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote:

Hey all, apologies for the stream of replies today. I looked into this a
bit more after finishing up my day job and having several “WTF” moments.
Trying to unify from this reply [1].

[1] 
https://patchwork.freedesktop.org/patch/691101/?series=158224&rev=1#comment_1273453
 

> On 12/9/25 14:51, Philipp Stanner wrote:
> ...
> >>>>>>>> How can free_job_work, through drm_sched_get_finished_job(), get and
> >>>>>>>> free the same job?
> >>>>>>>>
> >>>>>>>
> >>>>>>> It can't.
> >>>>>
> >>>>> But exactly that happens somehow. Don't ask me how, I have no idea.
> >>>
> >>> *Philipp refuses to elaborate and asks Christian*
> >>>
> >>> How are you so sure about that's what's happening? Anyways, assuming it
> >>> is true:
> >>
> >> [  489.134585] 
> >> ==================================================================
> >> [  489.141949] BUG: KASAN: slab-use-after-free in 
> >> amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> >> [  489.151339] Read of size 4 at addr ffff88a0d5f4214c by task 
> >> kworker/u128:0/12
> >> [  489.158686] 
> >> [  489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G      
> >>       E       6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1 
> >> PREEMPT(voluntary) 
> >> [  489.160285] Tainted: [E]=UNSIGNED_MODULE
> >> [  489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS 
> >> V1.03.B10 04/01/2019
> >> [  489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout 
> >> [gpu_sched]
> >> [  489.160306] Call Trace:
> >> [  489.160308]  <TASK>
> >> [  489.160311]  dump_stack_lvl+0x64/0x80
> >> [  489.160321]  print_report+0xce/0x630
> >> [  489.160328]  ? _raw_spin_lock_irqsave+0x86/0xd0
> >> [  489.160333]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> >> [  489.160337]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> >> [  489.161044]  kasan_report+0xb8/0xf0
> >> [  489.161049]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> >> [  489.161756]  amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> >> [  489.162464]  ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
> >> [  489.163170]  ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
> >> [  489.163904]  amdgpu_job_timedout+0x642/0x1400 [amdgpu]
> >> [  489.164698]  ? __pfx__raw_spin_lock+0x10/0x10
> >> [  489.164703]  ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
> >> [  489.165496]  ? _raw_spin_lock+0x75/0xc0
> >> [  489.165499]  ? __pfx__raw_spin_lock+0x10/0x10
> >> [  489.165503]  drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]
> > 
> > That doesn't show that it's free_job() who freed the memory.
> 
> [  489.405936] Freed by task 2501:
> [  489.409175]  kasan_save_stack+0x20/0x40
> [  489.413122]  kasan_save_track+0x14/0x30
> [  489.417064]  kasan_save_free_info+0x3b/0x60
> [  489.421355]  __kasan_slab_free+0x37/0x50
> [  489.425384]  kfree+0x1fe/0x3f0
> [  489.428547]  drm_sched_free_job_work+0x50e/0x930 [gpu_sched]
> [  489.434326]  process_one_work+0x679/0xff0
>  
> > @Vitaly: Can you reproduce the bug? If yes, adding debug prints
> > printing the jobs' addresses when allocated and when freed in
> > free_job() could be a solution.
> 
> We can reproduce this pretty reliable in our CI now.
> 
> > I repeat, we need more info :)
> > 
> >>
> >>>
> >>>>>
> >>>>> My educated guess is that the job somehow ends up on the pending list 
> >>>>> again.
> >>>
> >>> then the obvious question would be: does amdgpu touch the pending_list
> >>> itself, or does it only ever modify it through proper scheduler APIs?
> >>
> >> My educated guess is that drm_sched_stop() inserted the job back into the 
> >> pending list, but I still have no idea how it is possible that free_job is 
> >> running after the scheduler is stopped.

I believe I found your problem, referencing amdgpu/amdgpu_device.c here.

6718                 if (job)
6719                         ti = amdgpu_vm_get_task_info_pasid(adev, 
job->pasid);

Which is after:

6695 skip_hw_reset:
6696         r = amdgpu_device_sched_resume(&device_list, reset_context, 
job_signaled);
6697         if (r)
6698                 goto reset_unlock;

The job is likely added back into this free list here:

6676         amdgpu_device_halt_activities(adev, job, reset_context, 
&device_list,
6677                                       hive, need_emergency_restart);

So free_job runs and 'job->pasid' explodes.

Save off the pasid on the stack at top of this function and I suspect
your UAF goes away. This won't untangle this hair ball of code but I
believe this at least prevent explosions.

But let’s dig in further—amdgpu_device_halt_activities calls
drm_sched_stop (Xe just calls drm_sched_wqueue_stop for reference). This
function stops the work items, then adds the offending job back to the
pending list, iterates over each job, removes the CB, leaving the job in
the pending list. If the CB can be removed, it removes the job from
pending, maybe calls free_job if it’s not a guilty job, and if it is a
guilty job, defers the free_job to the timed-out job so it doesn’t
disappear. Like WTF?

Oh, it gets better—amdgpu_device_sched_resume calls drm_sched_start,
which iterates over the pending list and reinserts the same CB that
drm_sched_stop removed, then starts the scheduler. So the guilty job had
its CB successfully removed, and now it can immediately disappear—also
like WTF?

Free_guilty is clearly a hack around the job not being reference
counted, and it doesn’t even work in some cases. Putting that
aside, I think calling free_job shouldn’t really ever happen in TDR.
Imagine if drm_sched_job_timedout just took a reference to the job like
normal kernel code—free_guilty could be dropped, and immediately this
all becomes safe. Likewise, if the run_job work item had a reference to
the job, which it takes before adding to the pending list and drops
after it’s done touching it in this function, then run_job and free_job
work items could safely execute in parallel rather than relying on an
ordered workqueue to keep that part of the code safe.

> >>
> > 
> > And my uneducated guess is that it's happening in amdgpu. It seems a
> > sched_job lives inside an amdgpu_job. Can the latter be freed at other
> > places than free_job()?
> 
> Nope, except for error handling during creation and initialization.
> 
> > timedout_job() and free_job() cannot race against each other regarding
> > jobs. It's locked.
> > 
> > But maybe investigate Matthew's suggestion and look into the guilty
> > mechanism, too.
> 
> That looks just like a leftover from earlier attempts to fix the same problem.
> 
> I mean look at the git history of how often that problem came up...
> 
> Regards,
> Christian.
> 
> >>>>>>> At line 1269, the run_job work item is interrupted. Timed-out jobs 
> >>>>>>> run,
> >>>>>>> call free_job, which performs the final put. Then the run_job work 
> >>>>>>> item
> >>>>>>> resumes—and boom, UAF. Using the same reasoning, I think moving 
> >>>>>>> free_job
> >>>>>>> to the timed-out work queue could also cause issues.
> >>>>>>>
> >>>>>>> If run_job work item took a reference to the job before adding it to 
> >>>>>>> the
> >>>>>>> pending list and dropped it after it was done touching it in this
> >>>>>>> function, then yes, that would be safe. This is an argument for moving
> >>>>>>> reference counting into the base DRM scheduler class, it would make
> >>>>>>
> >>>>>> typo: s/DRM scheduler class/DRM job class
> >>>>>
> >>>>> That strongly sounds like re-inventing the scheduler fence.
> >>>>>
> >>>>
> >>>> Perhaps.
> >>>>
> >>>>> What if we completely drop the job object? Or merge it into the 
> >>>>> scheduler fence?
> >>>>>
> >>>>> The fence has reference counting, proper state transitions and a well 
> >>>>> defined lifetime.
> >>>>>
> >>>>> We would just need ->schedule and ->finished functions instead of 
> >>>>> ->run_job and ->free_job. Those callbacks would then still be called by 
> >>>>> the scheduler in work item context instead of the irq context of the 
> >>>>> dma_fence callbacks.
> >>>>
> >>>> Yes, definitely no IRQ contexts.
> >>>>
> >>>>>
> >>>>> The job can then be a void* in the scheduler fence where drivers can 
> >>>>> put anything they want and also drivers control the lifetime of that. 
> >>>>> E.g. they can free it during ->schedule as well as during ->finished.
> >>>>>
> >>>>
> >>>> I think this is a reasonable idea, but it would require major surgery
> >>>> across the subsystem plus the 11 upstream drivers I’m counting that use
> >>>> DRM scheduler. This would be a huge coordinated effort.
> >>>>
> >>>> So I see three options:
> >>>>
> >>>> 1. Rename free_job to put_job and document usage. Rip out free_guilty.
> >>>>    Likely the easiest and least invasive.
> >>>
> >>> I think I can live with that. It's work-effort wise the best we can do
> >>> right now. However, that does need proper documentation.
> >>
> >> I think that is the worst of all options.
> >>
> >> It basically says to the driver that the job lifetime problems created by 
> >> the scheduler is the driver problem and need to be worked around there.
> >>
> > 
> > My POV still mostly is that (with the current design) the driver must
> > not use jobs after free_job() was invoked. And when that happens is
> > unpredictable.
> > 

This is somewhat of an absurd statement from my point of view. I have a
valid job pointer, then I call another function (see above for an
example of how drm_sched_start/stop is unsafe) and it disappears behind
my back. The safe way to handle this is to take a local reference before
doing anything that could make it disappear. That is much more
reasonable than saying, “we have five things you can do in the
scheduler, and if you do any of them it isn’t safe to touch the job
afterward.”

> > To be unfair, we already have strange refcount expectations already.
> > But I sort of agree that there is no objectively good solution in
> > sight.
> > 
> >>>
> >>> Let me respin to my documentation series and upstream that soonish,
> >>> than we can build on top of that.
> > 
> >>>
> >>>
> >>> P.
> >>>
> >>>>
> >>>> 2. Move reference counting to the base DRM scheduler job object, provide 
> >>>> a
> >>>>    vfunc for the final job put, and document usage. Medium invasive.
> >>
> >> I strongly think that reference counting the job object just because the 
> >> scheduler needs it is a bad idea.
> >>
> >> With that we are just moving the hot potato from one side of our mouth to 
> >> the other without really solving the issue.
> >>

See above—I can’t say I agree with this assessment. I think the lack of
reference counting is exactly the problem. I don’t really understand the
pushback on a very well-understood concept (reference counting) in
Linux. I would sign up to fix the entire subsystem if we go this route.

> >> If a driver like XE needs that for some reason than that is perfectly fine.
> > 
> > Nouveau doesn't need it either.
> > 
> >>
> >>>> 3. Move job (driver) side tracking to the scheduler fence and let it
> >>>>    control the lifetime. Very invasive.
> >>
> >> Thinking about it more that is actually not so much of a problem.
> >>
> >> Let me try to code something together by the end of next week or so.
> > 
> > The hero Gotham needs :)
> > 

Are you sure about this one? I think unless the problems around
drm_sched_start/stop and free_guilty are fixed, my feeling is this
entire thing is still badly broken for anyone who wants to use those.

To sum up this whole email: I strongly disagree with option #3, but if
that is the consensus, I will, of course, support the effort.

Matt

> > 
> > P.
> 

Reply via email to