On 12/10/25 14:06, Philipp Stanner wrote: > On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote: >> On 12/10/25 10:58, Philipp Stanner wrote: >>> On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote: >>>> On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote: >> .. >>>>>>> My educated guess is that drm_sched_stop() inserted the job back into >>>>>>> the pending list, but I still have no idea how it is possible that >>>>>>> free_job is running after the scheduler is stopped. >>>> >>>> I believe I found your problem, referencing amdgpu/amdgpu_device.c here. >>>> >>>> 6718 if (job) >>>> 6719 ti = amdgpu_vm_get_task_info_pasid(adev, >>>> job->pasid); >> >> WTF! There it is! Thanks a lot for pointing that out! > > scripts/decode_stacktrace.sh should be able to find the exact location > of a UAF. Requires manual debugging with the kernel build tree at > hands, though. So that's difficult in CIs.
The debugging info was actually pointing to the return of the function. My guess is that it just optimized away something. >>> It also wasn't documented for a long time that drm_sched (through >>> spsc_queue) will explode if you don't use entities with a single >>> producer thread. >> >> That is actually documented, but not on the scheduler but rather the >> dma_fence. >> >> And that you can only have a single producer is a requirement inherited from >> the dma_fence and not scheduler specific at all. > > What does dma_fence have to do with it? It's about the spsc_queue being > racy like mad. You can access and modify dma_fence's in parallel > however you want – they are refcounted and locked. The problem is that the driver needs to guarantee that drm_sched_job_arm() and drm_sched_entity_push_job() can only be called by a single producer. Otherwise you violate the ordering rules of the underlying dma_fence. That is completely independent and comes even before the spsc queue comes into the picture. Regards, Christian. > > > P.
