On 12/10/25 14:06, Philipp Stanner wrote:
> On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote:
>> On 12/10/25 10:58, Philipp Stanner wrote:
>>> On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote:
>>>> On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote:
>> ..
>>>>>>> My educated guess is that drm_sched_stop() inserted the job back into 
>>>>>>> the pending list, but I still have no idea how it is possible that 
>>>>>>> free_job is running after the scheduler is stopped.
>>>>
>>>> I believe I found your problem, referencing amdgpu/amdgpu_device.c here.
>>>>
>>>> 6718                 if (job)
>>>> 6719                         ti = amdgpu_vm_get_task_info_pasid(adev, 
>>>> job->pasid);
>>
>> WTF! There it is! Thanks a lot for pointing that out!
> 
> scripts/decode_stacktrace.sh should be able to find the exact location
> of a UAF. Requires manual debugging with the kernel build tree at
> hands, though. So that's difficult in CIs.

The debugging info was actually pointing to the return of the function. My 
guess is that it just optimized away something.

>>> It also wasn't documented for a long time that drm_sched (through
>>> spsc_queue) will explode if you don't use entities with a single
>>> producer thread.
>>
>> That is actually documented, but not on the scheduler but rather the 
>> dma_fence.
>>
>> And that you can only have a single producer is a requirement inherited from 
>> the dma_fence and not scheduler specific at all.
> 
> What does dma_fence have to do with it? It's about the spsc_queue being
> racy like mad. You can access and modify dma_fence's in parallel
> however you want – they are refcounted and locked.

The problem is that the driver needs to guarantee that drm_sched_job_arm() and 
drm_sched_entity_push_job() can only be called by a single producer.

Otherwise you violate the ordering rules of the underlying dma_fence.

That is completely independent and comes even before the spsc queue comes into 
the picture.

Regards,
Christian.

> 
> 
> P.

Reply via email to