On 12/9/25 12:00, Philipp Stanner wrote:
> On Mon, 2025-12-08 at 13:39 -0800, Matthew Brost wrote:
>> On Mon, Dec 08, 2025 at 08:43:03PM +0100, Christian König wrote:
>>> On 12/8/25 20:02, Matthew Brost wrote:
>>>> On Mon, Dec 08, 2025 at 10:58:42AM -0800, Matthew Brost wrote:
>>>>> On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
>>>>>> On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
>>>>>>> On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
>>>>>>>> On 12/1/25 10:04, Philipp Stanner wrote:
>>> ....
>>>>>>>>> This entire fix idea seems to circle around the concept of relying yet
>>>>>>>>> again on the scheduler's internal behavior (i.e., when it schedules
>>>>>>>>> the
>>>>>>>>> call to free_job()).
>>>>>>>>>
>>>>>>>>> I think we discussed that at XDC, but how I see it if drivers have
>>>>>>>>> strange job life time requirements where a job shall outlive
>>>>>>>>> drm_sched's free_job() call, they must solve that with a proper
>>>>>>>>> synchronization mechanism.
>>>>>>>>
>>>>>>>> Well that is not correct as far as I can see.
>>>>>>>>
>>>>>>>> The problem here is rather that the scheduler gives the job as
>>>>>>>> parameter to the timedout_job() callback, but doesn't guarantee that
>>>>>>>> ->free_job() callback isn't called while timedout_job() runs.
>>>>>>>>
>>>>>>>> This should be prevented by removing the job in question from the
>>>>>>>> pending list (see drm_sched_job_timedout), but for some reason that
>>>>>>>> doesn't seem to work correctly.
>>>>>>>>
>>>>>>>
>>>>>>> Are you sure this is happening? It doesn’t seem possible, nor have I
>>>>>>> observed it.
>>>>>>
>>>>>> It's impossible, isn't it?
>>>>>>
>>>>>> static void drm_sched_job_timedout(struct work_struct *work) { struct
>>>>>> drm_gpu_scheduler *sched; struct drm_sched_job *job; enum
>>>>>> drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET; sched =
>>>>>> container_of(work, struct drm_gpu_scheduler, work_tdr.work); /* Protects
>>>>>> against concurrent deletion in drm_sched_get_finished_job */
>>>>>> spin_lock(&sched->job_list_lock); job =
>>>>>> list_first_entry_or_null(&sched->pending_list, struct drm_sched_job,
>>>>>> list); if (job) { /* * Remove the bad job so it cannot be freed by a
>>>>>> concurrent * &struct drm_sched_backend_ops.free_job. It will be *
>>>>>> reinserted after the scheduler's work items have been * cancelled, at
>>>>>> which point it's safe. */ list_del_init(&job->list);
>>>>>> spin_unlock(&sched->job_list_lock); status =
>>>>>> job->sched->ops->timedout_job(job);
>>>>>>
>>>>>>
>>>>>> 1. scheduler takes list lock
>>>>>> 2. removes job from list
>>>>>> 3. unlocks
>>>>>> 4. calls timedout_job callback
>>>>>>
>>>>>>
>>>>>> How can free_job_work, through drm_sched_get_finished_job(), get and
>>>>>> free the same job?
>>>>>>
>>>>>
>>>>> It can't.
>>>
>>> But exactly that happens somehow. Don't ask me how, I have no idea.
>
> *Philipp refuses to elaborate and asks Christian*
>
> How are you so sure about that's what's happening? Anyways, assuming it
> is true:
[ 489.134585]
==================================================================
[ 489.141949] BUG: KASAN: slab-use-after-free in
amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
[ 489.151339] Read of size 4 at addr ffff88a0d5f4214c by task kworker/u128:0/12
[ 489.158686]
[ 489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G
E 6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1
PREEMPT(voluntary)
[ 489.160285] Tainted: [E]=UNSIGNED_MODULE
[ 489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS
V1.03.B10 04/01/2019
[ 489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[ 489.160306] Call Trace:
[ 489.160308] <TASK>
[ 489.160311] dump_stack_lvl+0x64/0x80
[ 489.160321] print_report+0xce/0x630
[ 489.160328] ? _raw_spin_lock_irqsave+0x86/0xd0
[ 489.160333] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 489.160337] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
[ 489.161044] kasan_report+0xb8/0xf0
[ 489.161049] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
[ 489.161756] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
[ 489.162464] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
[ 489.163170] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
[ 489.163904] amdgpu_job_timedout+0x642/0x1400 [amdgpu]
[ 489.164698] ? __pfx__raw_spin_lock+0x10/0x10
[ 489.164703] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
[ 489.165496] ? _raw_spin_lock+0x75/0xc0
[ 489.165499] ? __pfx__raw_spin_lock+0x10/0x10
[ 489.165503] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]
>
>>>
>>> My educated guess is that the job somehow ends up on the pending list again.
>
> then the obvious question would be: does amdgpu touch the pending_list
> itself, or does it only ever modify it through proper scheduler APIs?
My educated guess is that drm_sched_stop() inserted the job back into the
pending list, but I still have no idea how it is possible that free_job is
running after the scheduler is stopped.
>>>
>>>>>
>>>>>> The pending_list is probably the one place where we actually lock
>>>>>> consistently and sanely.
>>>>>>
>>>>>> I think this needs to be debugged more intensively, Christian.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> What actually looks like a problem is that in drm_sched_job_timedout,
>>>>>>> free_job can be called. Look at [2]—if you’re using free_guilty (Xe
>>>>>>> isn’t, but [2] was Xe trying to do the same thing), this is actually
>>>>>>> unsafe. The free_guilty code should likely be removed as that definitely
>>>>>>> can explode under the right conditions.
>>>>>>
>>>>>> I'm right now not even sure why free_guilty exists, but I don't see how
>>>>>
>>>>> I'm sure why free_guilty exists either. If the fence is signaled in
>>>>> timedout job free_job will get scheduled on another work_item.
>>>>>
>>>>>> it's illegal for drm_sched to call free_job in drm_sched_job_timedout?
>>>>>>
>>>>>> free_job can be called at any point in time, drivers must expect that.
>>>>>> No lock is being held, and timedout_job already ran. So what's the
>>>>>> problem?
>>>>>>
>>>>>> For drivers with additional refcounting it would be even less of a
>>>>>> problem.
>>>>>>
>>>>>
>>>>> No, the scheduler can still reference the job.
>>>>>
>>>>> 1265 fence = sched->ops->run_job(sched_job);
>>>>> 1266 complete_all(&entity->entity_idle);
>>>>> 1267 drm_sched_fence_scheduled(s_fence, fence);
>>>>> 1268
>>>>> 1269 if (!IS_ERR_OR_NULL(fence)) {
>>>>> 1270 r = dma_fence_add_callback(fence, &sched_job->cb,
>>>>> 1271 drm_sched_job_done_cb);
>>>>> 1272 if (r == -ENOENT)
>>>>> 1273 drm_sched_job_done(sched_job, fence->error);
>>>>> 1274 else if (r)
>>>>> 1275 DRM_DEV_ERROR(sched->dev, "fence add
>>>>> callback failed (%d)\n", r);
>>>>> 1276
>>>>> 1277 dma_fence_put(fence);
>>>>> 1278 } else {
>>>>> 1279 drm_sched_job_done(sched_job, IS_ERR(fence) ?
>>>>> 1280 PTR_ERR(fence) : 0);
>>>>> 1281 }
>>>>> 1282
>>>>> 1283 wake_up(&sched->job_scheduled);
>>>>> 1284 drm_sched_run_job_queue(sched);
>>>>>
>>>>> At line 1269, the run_job work item is interrupted. Timed-out jobs run,
>>>>> call free_job, which performs the final put. Then the run_job work item
>>>>> resumes—and boom, UAF. Using the same reasoning, I think moving free_job
>>>>> to the timed-out work queue could also cause issues.
>>>>>
>>>>> If run_job work item took a reference to the job before adding it to the
>>>>> pending list and dropped it after it was done touching it in this
>>>>> function, then yes, that would be safe. This is an argument for moving
>>>>> reference counting into the base DRM scheduler class, it would make
>>>>
>>>> typo: s/DRM scheduler class/DRM job class
>>>
>>> That strongly sounds like re-inventing the scheduler fence.
>>>
>>
>> Perhaps.
>>
>>> What if we completely drop the job object? Or merge it into the scheduler
>>> fence?
>>>
>>> The fence has reference counting, proper state transitions and a well
>>> defined lifetime.
>>>
>>> We would just need ->schedule and ->finished functions instead of ->run_job
>>> and ->free_job. Those callbacks would then still be called by the scheduler
>>> in work item context instead of the irq context of the dma_fence callbacks.
>>
>> Yes, definitely no IRQ contexts.
>>
>>>
>>> The job can then be a void* in the scheduler fence where drivers can put
>>> anything they want and also drivers control the lifetime of that. E.g. they
>>> can free it during ->schedule as well as during ->finished.
>>>
>>
>> I think this is a reasonable idea, but it would require major surgery
>> across the subsystem plus the 11 upstream drivers I’m counting that use
>> DRM scheduler. This would be a huge coordinated effort.
>>
>> So I see three options:
>>
>> 1. Rename free_job to put_job and document usage. Rip out free_guilty.
>> Likely the easiest and least invasive.
>
> I think I can live with that. It's work-effort wise the best we can do
> right now. However, that does need proper documentation.
I think that is the worst of all options.
It basically says to the driver that the job lifetime problems created by the
scheduler is the driver problem and need to be worked around there.
>
> Let me respin to my documentation series and upstream that soonish,
> than we can build on top of that.
>
>
> P.
>
>>
>> 2. Move reference counting to the base DRM scheduler job object, provide a
>> vfunc for the final job put, and document usage. Medium invasive.
I strongly think that reference counting the job object just because the
scheduler needs it is a bad idea.
With that we are just moving the hot potato from one side of our mouth to the
other without really solving the issue.
If a driver like XE needs that for some reason than that is perfectly fine.
>> 3. Move job (driver) side tracking to the scheduler fence and let it
>> control the lifetime. Very invasive.
Thinking about it more that is actually not so much of a problem.
Let me try to code something together by the end of next week or so.
Regards,
Christian.
>>
>> I’ll support any option, but my personal bandwidth to dive into
>> something like #3 just isn’t there (of course, I can help review
>> scheduler changes and fix up Xe, etc.).
>>
>> Matt
>>
>>> Christian.
>>>
>>>>
>>>> Matt
>>>>
>>>>> ownership clear rather than relying on ordered work queues to keep
>>>>> everything safe.
>>>>>
>>>>>>>
>>>>>>> [2] git format-patch -1 ea2f6a77d0c40
>>>>>>>
>>>>>>>>> The first question would be: what does amdgpu need the job for after
>>>>>>>>> free_job() ran? What do you even need a job for still after there was
>>>>>>>>> a
>>>>>>>>> timeout?
>>>>>>>>
>>>>>>>> No, we just need the job structure alive as long as the timedout_job()
>>>>>>>> callback is running.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, I agree.
>>>>>>
>>>>>> As far as I can see that's how it's already implemented? No one can
>>>>>> free that job while timedout_job() is running in
>>>>>> drm_sched_job_timedout().
>>>>>>
>>>>>
>>>>> See above, free guility is still problematic.
>>>>>
>>>>>>>
>>>>>>>>> And if you really still need its contents, can't you memcpy() the job
>>>>>>>>> or something?
>>>>>>>>>
>>>>>>>>> Assuming that it really needs it and that that cannot easily be
>>>>>>>>> solved,
>>>>>>>>> I suppose the obvious answer for differing memory life times is
>>>>>>>>> refcounting. So amdgpu could just let drm_sched drop its reference in
>>>>>>>>> free_job(), and from then onward it's amdgpu's problem.
>>>>>>>>>
>>>>>>>>> I hope Matthew can educate us on how Xe does it.
>>>>>>>>
>>>>>>>> We discussed this on XDC and it was Matthew who brought up that this
>>>>>>>> can be solved by running timeout and free worker on the same single
>>>>>>>> threaded wq.
>>>>>>>>
>>>>>>>
>>>>>>> No, see my explainations above. This is not my suggestion.
>>>>>>>
>>>>>>>>>
>>>>>>>>> AFAIK Nouveau doesn't have that problem, because on timeout we just
>>>>>>>>> terminate the channel.
>>>>>>>>>
>>>>>>>>> Would also be interesting to hear whether other driver folks have the
>>>>>>>>> problem of free_job() being racy.
>>>>>>>>
>>>>>>>> I think that this is still a general problem with the drm scheduler
>>>>>>>> and not driver specific at all.
>>>>>>>>
>>>>>>>
>>>>>>> Maybe the free_guilty is likely a scheduler problem but I'm not seeing
>>>>>>> an issue aside from that.
>>>>>>
>>>>>> I also can't see the bug. I fail to see how drm_sched can free a job
>>>>>> that's currently in use in timedout_job(). If that can happen,
>>>>>> Christian, Vitaly, please point us to where and how. Only then can we
>>>>>> decide on how to fix it properly.
>>>>>>
>>>>>
>>>>> Again see above.
>>>>>
>>>>> Matt
>>>>>
>>>>>>
>>>>>> P.
>>>>>>
>>>>>>
>>>
>