On Tue, 2025-12-09 at 15:19 +0100, Christian König wrote:
> On 12/9/25 14:51, Philipp Stanner wrote:
> ...
> > > > > > > > > How can free_job_work, through drm_sched_get_finished_job(), 
> > > > > > > > > get and
> > > > > > > > > free the same job?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > It can't.
> > > > > > 
> > > > > > But exactly that happens somehow. Don't ask me how, I have no idea.
> > > > 
> > > > *Philipp refuses to elaborate and asks Christian*
> > > > 
> > > > How are you so sure about that's what's happening? Anyways, assuming it
> > > > is true:
> > > 
> > > [  489.134585] 
> > > ==================================================================
> > > [  489.141949] BUG: KASAN: slab-use-after-free in 
> > > amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > [  489.151339] Read of size 4 at addr ffff88a0d5f4214c by task 
> > > kworker/u128:0/12
> > > [  489.158686] 
> > > [  489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G     
> > >        E       6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1 
> > > PREEMPT(voluntary) 
> > > [  489.160285] Tainted: [E]=UNSIGNED_MODULE
> > > [  489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS 
> > > V1.03.B10 04/01/2019
> > > [  489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout 
> > > [gpu_sched]
> > > [  489.160306] Call Trace:
> > > [  489.160308]  <TASK>
> > > [  489.160311]  dump_stack_lvl+0x64/0x80
> > > [  489.160321]  print_report+0xce/0x630
> > > [  489.160328]  ? _raw_spin_lock_irqsave+0x86/0xd0
> > > [  489.160333]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> > > [  489.160337]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > [  489.161044]  kasan_report+0xb8/0xf0
> > > [  489.161049]  ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > [  489.161756]  amdgpu_device_gpu_recover+0x968/0x990 [amdgpu]
> > > [  489.162464]  ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu]
> > > [  489.163170]  ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu]
> > > [  489.163904]  amdgpu_job_timedout+0x642/0x1400 [amdgpu]
> > > [  489.164698]  ? __pfx__raw_spin_lock+0x10/0x10
> > > [  489.164703]  ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu]
> > > [  489.165496]  ? _raw_spin_lock+0x75/0xc0
> > > [  489.165499]  ? __pfx__raw_spin_lock+0x10/0x10
> > > [  489.165503]  drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched]
> > 
> > That doesn't show that it's free_job() who freed the memory.
> 
> [  489.405936] Freed by task 2501:
> [  489.409175]  kasan_save_stack+0x20/0x40
> [  489.413122]  kasan_save_track+0x14/0x30
> [  489.417064]  kasan_save_free_info+0x3b/0x60
> [  489.421355]  __kasan_slab_free+0x37/0x50
> [  489.425384]  kfree+0x1fe/0x3f0
> [  489.428547]  drm_sched_free_job_work+0x50e/0x930 [gpu_sched]
> [  489.434326]  process_one_work+0x679/0xff0

The time stamp shows that this free here took place after the UAF
occurred :D


>  
> > @Vitaly: Can you reproduce the bug? If yes, adding debug prints
> > printing the jobs' addresses when allocated and when freed in
> > free_job() could be a solution.
> 
> We can reproduce this pretty reliable in our CI now.
> 
> > I repeat, we need more info :)
> > 
> > > 
> > > > 
> > > > > > 
> > > > > > My educated guess is that the job somehow ends up on the pending 
> > > > > > list again.
> > > > 
> > > > then the obvious question would be: does amdgpu touch the pending_list
> > > > itself, or does it only ever modify it through proper scheduler APIs?
> > > 
> > > My educated guess is that drm_sched_stop() inserted the job back into the 
> > > pending list, but I still have no idea how it is possible that free_job 
> > > is running after the scheduler is stopped.
> > > 
> > 
> > And my uneducated guess is that it's happening in amdgpu. It seems a
> > sched_job lives inside an amdgpu_job. Can the latter be freed at other
> > places than free_job()?
> 
> Nope, except for error handling during creation and initialization.
> 
> > timedout_job() and free_job() cannot race against each other regarding
> > jobs. It's locked.
> > 
> > But maybe investigate Matthew's suggestion and look into the guilty
> > mechanism, too.
> 
> That looks just like a leftover from earlier attempts to fix the same problem.
> 
> I mean look at the git history of how often that problem came up...

If that's the case, then we don't want to yet add another solution to a
problem we don't fully understand and which, apparently, only occurs in
amdgpu today.

What we need is an analysis of what's happening. Only then can we
decide what to do.

Just switching the workqueues without such good justification receives
a NACK from me; also because of the unforseeable consequences –
free_job() is invoked extremely frequently, timedout_job() very rarely.
Drivers will not expect that their timeout_wq will be flooded with so
many work items. That could very certainly change behavior, cause
performance regressions and so on.


P.



Reply via email to