On Tue, Dec 09, 2025 at 03:28:49PM +0100, Philipp Stanner wrote:
> On Tue, 2025-12-09 at 15:19 +0100, Christian König wrote:
> > On 12/9/25 14:51, Philipp Stanner wrote:
> > ...
> > > > > > > > > > How can free_job_work, through 
> > > > > > > > > > drm_sched_get_finished_job(), get and
> > > > > > > > > > free the same job?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > It can't.
> > > > > > > 
> > > > > > > But exactly that happens somehow. Don't ask me how, I have no 
> > > > > > > idea.
> > > > > 
> > > > > *Philipp refuses to elaborate and asks Christian*
> > > > > 
> > > > > How are you so sure about that's what's happening? Anyways, assuming 
> > > > > it
> > > > > is true:
> > > > 
> > > > [  489.134585] 
> > > > ==================================================================
> > > > [  489.141949] BUG: KASAN: slab-use-after-free in 
> > > > amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > > [  489.151339] Read of size 4 at addr ffff88a0d5f4214c by task 
> > > > kworker/u128:0/12
> > > > [  489.158686] 
> > > > [  489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G   
> > > >          E       6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f 
> > > > #1 PREEMPT(voluntary) 
> > > > [  489.160285] Tainted: [E]=UNSIGNED_MODULE
> > > > [  489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS 
> > > > V1.03.B10 04/01/2019
> > > > [  489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout 
> > > > [gpu_sched]
> > > > [  489.160306] Call Trace:
> > > > [  489.160308]  <TASK>
> > > > [  489.160311]  dump_stack_lvl+0x64/0x80
> > > > [  489.160321]  print_report+0xce/0x630
> > > > [  489.160328]  ? _raw_spin_lock_irqsave+0x86/0xd0
> > > > [  489.160333]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> > > > [  489.160337]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > > [  489.161044]  kasan_report+0xb8/0xf0
> > > > [  489.161049]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > > [  489.161756]  amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > > [  489.162464]  ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
> > > > [  489.163170]  ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
> > > > [  489.163904]  amdgpu_job_timedout+0x642/0x1400 [amdgpu]
> > > > [  489.164698]  ? __pfx__raw_spin_lock+0x10/0x10
> > > > [  489.164703]  ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
> > > > [  489.165496]  ? _raw_spin_lock+0x75/0xc0
> > > > [  489.165499]  ? __pfx__raw_spin_lock+0x10/0x10
> > > > [  489.165503]  drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]
> > > 
> > > That doesn't show that it's free_job() who freed the memory.
> > 
> > [  489.405936] Freed by task 2501:
> > [  489.409175]  kasan_save_stack+0x20/0x40
> > [  489.413122]  kasan_save_track+0x14/0x30
> > [  489.417064]  kasan_save_free_info+0x3b/0x60
> > [  489.421355]  __kasan_slab_free+0x37/0x50
> > [  489.425384]  kfree+0x1fe/0x3f0
> > [  489.428547]  drm_sched_free_job_work+0x50e/0x930 [gpu_sched]
> > [  489.434326]  process_one_work+0x679/0xff0
> 
> The time stamp shows that this free here took place after the UAF
> occurred :D
> 
> 
> >  
> > > @Vitaly: Can you reproduce the bug? If yes, adding debug prints
> > > printing the jobs' addresses when allocated and when freed in
> > > free_job() could be a solution.
> > 
> > We can reproduce this pretty reliable in our CI now.
> > 
> > > I repeat, we need more info :)
> > > 
> > > > 
> > > > > 
> > > > > > > 
> > > > > > > My educated guess is that the job somehow ends up on the pending 
> > > > > > > list again.
> > > > > 
> > > > > then the obvious question would be: does amdgpu touch the pending_list
> > > > > itself, or does it only ever modify it through proper scheduler APIs?
> > > > 
> > > > My educated guess is that drm_sched_stop() inserted the job back into 
> > > > the pending list, but I still have no idea how it is possible that 
> > > > free_job is running after the scheduler is stopped.

I believe I found your problem, referencing amdgpu/amdgpu_device.c here.

6718                 if (job)
6719                         ti = amdgpu_vm_get_task_info_pasid(adev, 
job->pasid);

Which is after:

6695 skip_hw_reset:
6696         r = amdgpu_device_sched_resume(&device_list, reset_context, 
job_signaled);
6697         if (r)
6698                 goto reset_unlock;

The job is likely added back into this free list here:

6676         amdgpu_device_halt_activities(adev, job, reset_context, 
&device_list,
6677                                       hive, need_emergency_restart);

So free_job runs and 'job->pasid' explodes.

Save off the pasid on the stack at top of this function and I suspect
your UAF goes away. This won't untangle this hair ball of code but I
believe this at least prevent explosions.

Matt 

> > > > 
> > > 
> > > And my uneducated guess is that it's happening in amdgpu. It seems a
> > > sched_job lives inside an amdgpu_job. Can the latter be freed at other
> > > places than free_job()?
> > 
> > Nope, except for error handling during creation and initialization.
> > 
> > > timedout_job() and free_job() cannot race against each other regarding
> > > jobs. It's locked.
> > > 
> > > But maybe investigate Matthew's suggestion and look into the guilty
> > > mechanism, too.
> > 
> > That looks just like a leftover from earlier attempts to fix the same 
> > problem.
> > 
> > I mean look at the git history of how often that problem came up...
> 
> If that's the case, then we don't want to yet add another solution to a
> problem we don't fully understand and which, apparently, only occurs in
> amdgpu today.
> 
> What we need is an analysis of what's happening. Only then can we
> decide what to do.
> 
> Just switching the workqueues without such good justification receives
> a NACK from me; also because of the unforseeable consequences –
> free_job() is invoked extremely frequently, timedout_job() very rarely.
> Drivers will not expect that their timeout_wq will be flooded with so
> many work items. That could very certainly change behavior, cause
> performance regressions and so on.
> 
> 
> P.
> 
> 
> 

Reply via email to