On 12/8/25 20:02, Matthew Brost wrote:
> On Mon, Dec 08, 2025 at 10:58:42AM -0800, Matthew Brost wrote:
>> On Mon, Dec 08, 2025 at 11:35:33AM +0100, Philipp Stanner wrote:
>>> On Fri, 2025-12-05 at 09:30 -0800, Matthew Brost wrote:
>>>> On Fri, Dec 05, 2025 at 11:18:21AM +0100, Christian König wrote:
>>>>> On 12/1/25 10:04, Philipp Stanner wrote:
....
>>>>>> This entire fix idea seems to circle around the concept of relying yet
>>>>>> again on the scheduler's internal behavior (i.e., when it schedules the
>>>>>> call to free_job()).
>>>>>>
>>>>>> I think we discussed that at XDC, but how I see it if drivers have
>>>>>> strange job life time requirements where a job shall outlive
>>>>>> drm_sched's free_job() call, they must solve that with a proper
>>>>>> synchronization mechanism.
>>>>>
>>>>> Well that is not correct as far as I can see.
>>>>>
>>>>> The problem here is rather that the scheduler gives the job as parameter
>>>>> to the timedout_job() callback, but doesn't guarantee that ->free_job()
>>>>> callback isn't called while timedout_job() runs.
>>>>>
>>>>> This should be prevented by removing the job in question from the pending
>>>>> list (see drm_sched_job_timedout), but for some reason that doesn't seem
>>>>> to work correctly.
>>>>>
>>>>
>>>> Are you sure this is happening? It doesn’t seem possible, nor have I
>>>> observed it.
>>>
>>> It's impossible, isn't it?
>>>
>>> static void drm_sched_job_timedout(struct work_struct *work) { struct
>>> drm_gpu_scheduler *sched; struct drm_sched_job *job; enum
>>> drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET; sched =
>>> container_of(work, struct drm_gpu_scheduler, work_tdr.work); /* Protects
>>> against concurrent deletion in drm_sched_get_finished_job */
>>> spin_lock(&sched->job_list_lock); job =
>>> list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list);
>>> if (job) { /* * Remove the bad job so it cannot be freed by a concurrent *
>>> &struct drm_sched_backend_ops.free_job. It will be * reinserted after the
>>> scheduler's work items have been * cancelled, at which point it's safe. */
>>> list_del_init(&job->list); spin_unlock(&sched->job_list_lock); status =
>>> job->sched->ops->timedout_job(job);
>>>
>>>
>>> 1. scheduler takes list lock
>>> 2. removes job from list
>>> 3. unlocks
>>> 4. calls timedout_job callback
>>>
>>>
>>> How can free_job_work, through drm_sched_get_finished_job(), get and
>>> free the same job?
>>>
>>
>> It can't.
But exactly that happens somehow. Don't ask me how, I have no idea.
My educated guess is that the job somehow ends up on the pending list again.
>>
>>> The pending_list is probably the one place where we actually lock
>>> consistently and sanely.
>>>
>>> I think this needs to be debugged more intensively, Christian.
>>>
>>>
>>>>
>>>> What actually looks like a problem is that in drm_sched_job_timedout,
>>>> free_job can be called. Look at [2]—if you’re using free_guilty (Xe
>>>> isn’t, but [2] was Xe trying to do the same thing), this is actually
>>>> unsafe. The free_guilty code should likely be removed as that definitely
>>>> can explode under the right conditions.
>>>
>>> I'm right now not even sure why free_guilty exists, but I don't see how
>>
>> I'm sure why free_guilty exists either. If the fence is signaled in
>> timedout job free_job will get scheduled on another work_item.
>>
>>> it's illegal for drm_sched to call free_job in drm_sched_job_timedout?
>>>
>>> free_job can be called at any point in time, drivers must expect that.
>>> No lock is being held, and timedout_job already ran. So what's the
>>> problem?
>>>
>>> For drivers with additional refcounting it would be even less of a
>>> problem.
>>>
>>
>> No, the scheduler can still reference the job.
>>
>> 1265 fence = sched->ops->run_job(sched_job);
>> 1266 complete_all(&entity->entity_idle);
>> 1267 drm_sched_fence_scheduled(s_fence, fence);
>> 1268
>> 1269 if (!IS_ERR_OR_NULL(fence)) {
>> 1270 r = dma_fence_add_callback(fence, &sched_job->cb,
>> 1271 drm_sched_job_done_cb);
>> 1272 if (r == -ENOENT)
>> 1273 drm_sched_job_done(sched_job, fence->error);
>> 1274 else if (r)
>> 1275 DRM_DEV_ERROR(sched->dev, "fence add callback
>> failed (%d)\n", r);
>> 1276
>> 1277 dma_fence_put(fence);
>> 1278 } else {
>> 1279 drm_sched_job_done(sched_job, IS_ERR(fence) ?
>> 1280 PTR_ERR(fence) : 0);
>> 1281 }
>> 1282
>> 1283 wake_up(&sched->job_scheduled);
>> 1284 drm_sched_run_job_queue(sched);
>>
>> At line 1269, the run_job work item is interrupted. Timed-out jobs run,
>> call free_job, which performs the final put. Then the run_job work item
>> resumes—and boom, UAF. Using the same reasoning, I think moving free_job
>> to the timed-out work queue could also cause issues.
>>
>> If run_job work item took a reference to the job before adding it to the
>> pending list and dropped it after it was done touching it in this
>> function, then yes, that would be safe. This is an argument for moving
>> reference counting into the base DRM scheduler class, it would make
>
> typo: s/DRM scheduler class/DRM job class
That strongly sounds like re-inventing the scheduler fence.
What if we completely drop the job object? Or merge it into the scheduler fence?
The fence has reference counting, proper state transitions and a well defined
lifetime.
We would just need ->schedule and ->finished functions instead of ->run_job and
->free_job. Those callbacks would then still be called by the scheduler in work
item context instead of the irq context of the dma_fence callbacks.
The job can then be a void* in the scheduler fence where drivers can put
anything they want and also drivers control the lifetime of that. E.g. they can
free it during ->schedule as well as during ->finished.
Christian.
>
> Matt
>
>> ownership clear rather than relying on ordered work queues to keep
>> everything safe.
>>
>>>>
>>>> [2] git format-patch -1 ea2f6a77d0c40
>>>>
>>>>>> The first question would be: what does amdgpu need the job for after
>>>>>> free_job() ran? What do you even need a job for still after there was a
>>>>>> timeout?
>>>>>
>>>>> No, we just need the job structure alive as long as the timedout_job()
>>>>> callback is running.
>>>>>
>>>>
>>>> Yes, I agree.
>>>
>>> As far as I can see that's how it's already implemented? No one can
>>> free that job while timedout_job() is running in
>>> drm_sched_job_timedout().
>>>
>>
>> See above, free guility is still problematic.
>>
>>>>
>>>>>> And if you really still need its contents, can't you memcpy() the job
>>>>>> or something?
>>>>>>
>>>>>> Assuming that it really needs it and that that cannot easily be solved,
>>>>>> I suppose the obvious answer for differing memory life times is
>>>>>> refcounting. So amdgpu could just let drm_sched drop its reference in
>>>>>> free_job(), and from then onward it's amdgpu's problem.
>>>>>>
>>>>>> I hope Matthew can educate us on how Xe does it.
>>>>>
>>>>> We discussed this on XDC and it was Matthew who brought up that this can
>>>>> be solved by running timeout and free worker on the same single threaded
>>>>> wq.
>>>>>
>>>>
>>>> No, see my explainations above. This is not my suggestion.
>>>>
>>>>>>
>>>>>> AFAIK Nouveau doesn't have that problem, because on timeout we just
>>>>>> terminate the channel.
>>>>>>
>>>>>> Would also be interesting to hear whether other driver folks have the
>>>>>> problem of free_job() being racy.
>>>>>
>>>>> I think that this is still a general problem with the drm scheduler and
>>>>> not driver specific at all.
>>>>>
>>>>
>>>> Maybe the free_guilty is likely a scheduler problem but I'm not seeing
>>>> an issue aside from that.
>>>
>>> I also can't see the bug. I fail to see how drm_sched can free a job
>>> that's currently in use in timedout_job(). If that can happen,
>>> Christian, Vitaly, please point us to where and how. Only then can we
>>> decide on how to fix it properly.
>>>
>>
>> Again see above.
>>
>> Matt
>>
>>>
>>> P.
>>>
>>>