On 12/9/25 14:51, Philipp Stanner wrote: ... >>>>>>>> How can free_job_work, through drm_sched_get_finished_job(), get and >>>>>>>> free the same job? >>>>>>>> >>>>>>> >>>>>>> It can't. >>>>> >>>>> But exactly that happens somehow. Don't ask me how, I have no idea. >>> >>> *Philipp refuses to elaborate and asks Christian* >>> >>> How are you so sure about that's what's happening? Anyways, assuming it >>> is true: >> >> [ 489.134585] >> ================================================================== >> [ 489.141949] BUG: KASAN: slab-use-after-free in >> amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] >> [ 489.151339] Read of size 4 at addr ffff88a0d5f4214c by task >> kworker/u128:0/12 >> [ 489.158686] >> [ 489.160277] CPU: 11 UID: 0 PID: 12 Comm: kworker/u128:0 Tainted: G >> E 6.16.0-1289896.3.zuul.0ec208edc00d48a9bae1719675cb777f #1 >> PREEMPT(voluntary) >> [ 489.160285] Tainted: [E]=UNSIGNED_MODULE >> [ 489.160288] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS >> V1.03.B10 04/01/2019 >> [ 489.160292] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] >> [ 489.160306] Call Trace: >> [ 489.160308] <TASK> >> [ 489.160311] dump_stack_lvl+0x64/0x80 >> [ 489.160321] print_report+0xce/0x630 >> [ 489.160328] ? _raw_spin_lock_irqsave+0x86/0xd0 >> [ 489.160333] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 >> [ 489.160337] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] >> [ 489.161044] kasan_report+0xb8/0xf0 >> [ 489.161049] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] >> [ 489.161756] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] >> [ 489.162464] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu] >> [ 489.163170] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu] >> [ 489.163904] amdgpu_job_timedout+0x642/0x1400 [amdgpu] >> [ 489.164698] ? __pfx__raw_spin_lock+0x10/0x10 >> [ 489.164703] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu] >> [ 489.165496] ? _raw_spin_lock+0x75/0xc0 >> [ 489.165499] ? __pfx__raw_spin_lock+0x10/0x10 >> [ 489.165503] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched] > > That doesn't show that it's free_job() who freed the memory.
[ 489.405936] Freed by task 2501: [ 489.409175] kasan_save_stack+0x20/0x40 [ 489.413122] kasan_save_track+0x14/0x30 [ 489.417064] kasan_save_free_info+0x3b/0x60 [ 489.421355] __kasan_slab_free+0x37/0x50 [ 489.425384] kfree+0x1fe/0x3f0 [ 489.428547] drm_sched_free_job_work+0x50e/0x930 [gpu_sched] [ 489.434326] process_one_work+0x679/0xff0 > @Vitaly: Can you reproduce the bug? If yes, adding debug prints > printing the jobs' addresses when allocated and when freed in > free_job() could be a solution. We can reproduce this pretty reliable in our CI now. > I repeat, we need more info :) > >> >>> >>>>> >>>>> My educated guess is that the job somehow ends up on the pending list >>>>> again. >>> >>> then the obvious question would be: does amdgpu touch the pending_list >>> itself, or does it only ever modify it through proper scheduler APIs? >> >> My educated guess is that drm_sched_stop() inserted the job back into the >> pending list, but I still have no idea how it is possible that free_job is >> running after the scheduler is stopped. >> > > And my uneducated guess is that it's happening in amdgpu. It seems a > sched_job lives inside an amdgpu_job. Can the latter be freed at other > places than free_job()? Nope, except for error handling during creation and initialization. > timedout_job() and free_job() cannot race against each other regarding > jobs. It's locked. > > But maybe investigate Matthew's suggestion and look into the guilty > mechanism, too. That looks just like a leftover from earlier attempts to fix the same problem. I mean look at the git history of how often that problem came up... Regards, Christian. >>>>>>> At line 1269, the run_job work item is interrupted. Timed-out jobs run, >>>>>>> call free_job, which performs the final put. Then the run_job work item >>>>>>> resumes—and boom, UAF. Using the same reasoning, I think moving free_job >>>>>>> to the timed-out work queue could also cause issues. >>>>>>> >>>>>>> If run_job work item took a reference to the job before adding it to the >>>>>>> pending list and dropped it after it was done touching it in this >>>>>>> function, then yes, that would be safe. This is an argument for moving >>>>>>> reference counting into the base DRM scheduler class, it would make >>>>>> >>>>>> typo: s/DRM scheduler class/DRM job class >>>>> >>>>> That strongly sounds like re-inventing the scheduler fence. >>>>> >>>> >>>> Perhaps. >>>> >>>>> What if we completely drop the job object? Or merge it into the scheduler >>>>> fence? >>>>> >>>>> The fence has reference counting, proper state transitions and a well >>>>> defined lifetime. >>>>> >>>>> We would just need ->schedule and ->finished functions instead of >>>>> ->run_job and ->free_job. Those callbacks would then still be called by >>>>> the scheduler in work item context instead of the irq context of the >>>>> dma_fence callbacks. >>>> >>>> Yes, definitely no IRQ contexts. >>>> >>>>> >>>>> The job can then be a void* in the scheduler fence where drivers can put >>>>> anything they want and also drivers control the lifetime of that. E.g. >>>>> they can free it during ->schedule as well as during ->finished. >>>>> >>>> >>>> I think this is a reasonable idea, but it would require major surgery >>>> across the subsystem plus the 11 upstream drivers I’m counting that use >>>> DRM scheduler. This would be a huge coordinated effort. >>>> >>>> So I see three options: >>>> >>>> 1. Rename free_job to put_job and document usage. Rip out free_guilty. >>>> Likely the easiest and least invasive. >>> >>> I think I can live with that. It's work-effort wise the best we can do >>> right now. However, that does need proper documentation. >> >> I think that is the worst of all options. >> >> It basically says to the driver that the job lifetime problems created by >> the scheduler is the driver problem and need to be worked around there. >> > > My POV still mostly is that (with the current design) the driver must > not use jobs after free_job() was invoked. And when that happens is > unpredictable. > > To be unfair, we already have strange refcount expectations already. > But I sort of agree that there is no objectively good solution in > sight. > >>> >>> Let me respin to my documentation series and upstream that soonish, >>> than we can build on top of that. > >>> >>> >>> P. >>> >>>> >>>> 2. Move reference counting to the base DRM scheduler job object, provide a >>>> vfunc for the final job put, and document usage. Medium invasive. >> >> I strongly think that reference counting the job object just because the >> scheduler needs it is a bad idea. >> >> With that we are just moving the hot potato from one side of our mouth to >> the other without really solving the issue. >> >> If a driver like XE needs that for some reason than that is perfectly fine. > > Nouveau doesn't need it either. > >> >>>> 3. Move job (driver) side tracking to the scheduler fence and let it >>>> control the lifetime. Very invasive. >> >> Thinking about it more that is actually not so much of a problem. >> >> Let me try to code something together by the end of next week or so. > > The hero Gotham needs :) > > > P.
