Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Christian König Tue, 09 Dec 2025 06:20:06 -0800

On 12/9/25 14:51, Philipp Stanner wrote:
...
>>>>>>>> How can free_job_work, through drm_sched_get_finished_job(), get and
>>>>>>>> free the same job?
>>>>>>>>
>>>>>>>
>>>>>>> It can't.
>>>>>
>>>>> But exactly that happens somehow. Don't ask me how, I have no idea.
>>>
>>> *Philipp refuses to elaborate and asks Christian*
>>>
>>> How are you so sure about that's what's happening? Anyways, assuming it
>>> is true:
>>
>> [  489.134585] 
>> ==================================================================
>> [  489.141949] BUG: KASAN: slab-use-after-free in 
>> amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
>> [  489.151339] Read of size 4 at addr ffff88a0d5f4214c by task 
>> kworker/u128:0/12
>> [  489.158686] 
>> [  489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G        
>>     E       6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1 
>> PREEMPT(voluntary) 
>> [  489.160285] Tainted: [E]=UNSIGNED_MODULE
>> [  489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS 
>> V1.03.B10 04/01/2019
>> [  489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
>> [  489.160306] Call Trace:
>> [  489.160308]  <TASK>
>> [  489.160311]  dump_stack_lvl+0x64/0x80
>> [  489.160321]  print_report+0xce/0x630
>> [  489.160328]  ? _raw_spin_lock_irqsave+0x86/0xd0
>> [  489.160333]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
>> [  489.160337]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
>> [  489.161044]  kasan_report+0xb8/0xf0
>> [  489.161049]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
>> [  489.161756]  amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
>> [  489.162464]  ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
>> [  489.163170]  ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
>> [  489.163904]  amdgpu_job_timedout+0x642/0x1400 [amdgpu]
>> [  489.164698]  ? __pfx__raw_spin_lock+0x10/0x10
>> [  489.164703]  ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
>> [  489.165496]  ? _raw_spin_lock+0x75/0xc0
>> [  489.165499]  ? __pfx__raw_spin_lock+0x10/0x10
>> [  489.165503]  drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]
> 
> That doesn't show that it's free_job() who freed the memory.


[  489.405936] Freed by task 2501:
[  489.409175]  kasan_save_stack+0x20/0x40
[  489.413122]  kasan_save_track+0x14/0x30
[  489.417064]  kasan_save_free_info+0x3b/0x60
[  489.421355]  __kasan_slab_free+0x37/0x50
[  489.425384]  kfree+0x1fe/0x3f0
[  489.428547]  drm_sched_free_job_work+0x50e/0x930 [gpu_sched]
[  489.434326]  process_one_work+0x679/0xff0
 
> @Vitaly: Can you reproduce the bug? If yes, adding debug prints
> printing the jobs' addresses when allocated and when freed in
> free_job() could be a solution.

We can reproduce this pretty reliable in our CI now.

> I repeat, we need more info :)
> 
>>
>>>
>>>>>
>>>>> My educated guess is that the job somehow ends up on the pending list 
>>>>> again.
>>>
>>> then the obvious question would be: does amdgpu touch the pending_list
>>> itself, or does it only ever modify it through proper scheduler APIs?
>>
>> My educated guess is that drm_sched_stop() inserted the job back into the 
>> pending list, but I still have no idea how it is possible that free_job is 
>> running after the scheduler is stopped.
>>
> 
> And my uneducated guess is that it's happening in amdgpu. It seems a
> sched_job lives inside an amdgpu_job. Can the latter be freed at other
> places than free_job()?

Nope, except for error handling during creation and initialization.

> timedout_job() and free_job() cannot race against each other regarding
> jobs. It's locked.
> 
> But maybe investigate Matthew's suggestion and look into the guilty
> mechanism, too.

That looks just like a leftover from earlier attempts to fix the same problem.

I mean look at the git history of how often that problem came up...

Regards,
Christian.

>>>>>>> At line 1269, the run_job work item is interrupted. Timed-out jobs run,
>>>>>>> call free_job, which performs the final put. Then the run_job work item
>>>>>>> resumes—and boom, UAF. Using the same reasoning, I think moving free_job
>>>>>>> to the timed-out work queue could also cause issues.
>>>>>>>
>>>>>>> If run_job work item took a reference to the job before adding it to the
>>>>>>> pending list and dropped it after it was done touching it in this
>>>>>>> function, then yes, that would be safe. This is an argument for moving
>>>>>>> reference counting into the base DRM scheduler class, it would make
>>>>>>
>>>>>> typo: s/DRM scheduler class/DRM job class
>>>>>
>>>>> That strongly sounds like re-inventing the scheduler fence.
>>>>>
>>>>
>>>> Perhaps.
>>>>
>>>>> What if we completely drop the job object? Or merge it into the scheduler 
>>>>> fence?
>>>>>
>>>>> The fence has reference counting, proper state transitions and a well 
>>>>> defined lifetime.
>>>>>
>>>>> We would just need ->schedule and ->finished functions instead of 
>>>>> ->run_job and ->free_job. Those callbacks would then still be called by 
>>>>> the scheduler in work item context instead of the irq context of the 
>>>>> dma_fence callbacks.
>>>>
>>>> Yes, definitely no IRQ contexts.
>>>>
>>>>>
>>>>> The job can then be a void* in the scheduler fence where drivers can put 
>>>>> anything they want and also drivers control the lifetime of that. E.g. 
>>>>> they can free it during ->schedule as well as during ->finished.
>>>>>
>>>>
>>>> I think this is a reasonable idea, but it would require major surgery
>>>> across the subsystem plus the 11 upstream drivers I’m counting that use
>>>> DRM scheduler. This would be a huge coordinated effort.
>>>>
>>>> So I see three options:
>>>>
>>>> 1. Rename free_job to put_job and document usage. Rip out free_guilty.
>>>>    Likely the easiest and least invasive.
>>>
>>> I think I can live with that. It's work-effort wise the best we can do
>>> right now. However, that does need proper documentation.
>>
>> I think that is the worst of all options.
>>
>> It basically says to the driver that the job lifetime problems created by 
>> the scheduler is the driver problem and need to be worked around there.
>>
> 
> My POV still mostly is that (with the current design) the driver must
> not use jobs after free_job() was invoked. And when that happens is
> unpredictable.
> 
> To be unfair, we already have strange refcount expectations already.
> But I sort of agree that there is no objectively good solution in
> sight.
> 
>>>
>>> Let me respin to my documentation series and upstream that soonish,
>>> than we can build on top of that.
> 
>>>
>>>
>>> P.
>>>
>>>>
>>>> 2. Move reference counting to the base DRM scheduler job object, provide a
>>>>    vfunc for the final job put, and document usage. Medium invasive.
>>
>> I strongly think that reference counting the job object just because the 
>> scheduler needs it is a bad idea.
>>
>> With that we are just moving the hot potato from one side of our mouth to 
>> the other without really solving the issue.
>>
>> If a driver like XE needs that for some reason than that is perfectly fine.
> 
> Nouveau doesn't need it either.
> 
>>
>>>> 3. Move job (driver) side tracking to the scheduler fence and let it
>>>>    control the lifetime. Very invasive.
>>
>> Thinking about it more that is actually not so much of a problem.
>>
>> Let me try to code something together by the end of next week or so.
> 
> The hero Gotham needs :)
> 
> 
> P.

Re: [PATCH] drm/sched: run free_job work on timeout workqueue

Reply via email to