userq: unlock cancel_delayed_work_sync for hang_detect_work

Christian König Fri, 13 Mar 2026 05:21:15 -0700

On 3/12/26 12:32, Khatri, Sunil wrote:
> 
> On 12-03-2026 02:50 pm, Christian König wrote:
>> On 3/12/26 10:16, Sunil Khatri wrote:
>>> cancel_delayed_work_sync for work hand_detect_work should not be
>>> locked since the amdgpu_userq_hang_detect_work also need the same
>>> mutex and when they run together it could be a deadlock.
>>>
>>> we do not need to hold the mutex for
>>> cancel_delayed_work_sync(&queue->hang_detect_work). With this in place
>>> if cancel and worker thread run at same time they will not deadlock.
>>>
>>> Due to any failures if there is a hand detect and reset that there a
>>> deadlock scenarios between cancel and running the main thread.
>>>
>>> [ 243.118276] task:kworker/9:0 state:D stack:0 pid:73 tgid:73 ppid:2 
>>> task_flags:0x4208060 flags:0x00080000
>>> [ 243.118283] Workqueue: events amdgpu_userq_hang_detect_work [amdgpu]
>>> [ 243.118636] Call Trace:
>>> [ 243.118639] <TASK>
>>> [ 243.118644] __schedule+0x581/0x1810
>>> [ 243.118649] ? srso_return_thunk+0x5/0x5f
>>> [ 243.118656] ? srso_return_thunk+0x5/0x5f
>>> [ 243.118659] ? wake_up_process+0x15/0x20
>>> [ 243.118665] schedule+0x64/0xe0
>>> [ 243.118668] schedule_preempt_disabled+0x15/0x30
>>> [ 243.118671] __mutex_lock+0x346/0x950
>>> [ 243.118677] __mutex_lock_slowpath+0x13/0x20
>>> [ 243.118681] mutex_lock+0x2c/0x40
>>> [ 243.118684] amdgpu_userq_hang_detect_work+0x63/0x90 [amdgpu]
>>> [ 243.118888] process_scheduled_works+0x1f0/0x450
>>> [ 243.118894] worker_thread+0x27f/0x370
>>> [ 243.118899] kthread+0x1ed/0x210
>>> [ 243.118903] ? __pfx_worker_thread+0x10/0x10
>>> [ 243.118906] ? srso_return_thunk+0x5/0x5f
>>> [ 243.118909] ? __pfx_kthread+0x10/0x10
>>> [ 243.118913] ret_from_fork+0x10f/0x1b0
>>> [ 243.118916] ? __pfx_kthread+0x10/0x10
>>> [ 243.118920] ret_from_fork_asm+0x1a/0x30
>> Good catch, but userq destruction is completely broken in quite a number of 
>> ways.
>>
>> Have you taken a look at my patch "drm/amdgpu: fix eviction fence and userq 
>> manager shutdown"? How does this here interacts with that?
> Yeah even after that patch the below is still open, and a deadlock with 
> amdgpu_userq_hang_detect_work is still possible. So i guess we need this fix 
> still, irrespective of your fixes in drm/amdgpu: fix eviction fence and userq 
> manager shutdown


Make sense, just one minor comment below.

>         if(queue->hang_detect_fence) {
>                 cancel_delayed_work_sync(&queue->hang_detect_work);
>                 queue->hang_detect_fence=NULL;
>         }
> 
> 
> Regards
> Sunil Khatri
> 
>> Thanks,
>> Christian.
>>
>>> Signed-off-by: Sunil Khatri <[email protected]>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 9 ++++++++-
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index 32541f1bde6d..c5875e175918 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -621,15 +621,22 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, 
>>> struct amdgpu_usermode_que
>>>  {
>>>     struct amdgpu_device *adev = uq_mgr->adev;
>>>     int r = 0;
>>> +   bool hang_detect_fence = false;
>>>  
>>>     cancel_delayed_work_sync(&uq_mgr->resume_work);
>>>     mutex_lock(&uq_mgr->userq_mutex);
>>>     amdgpu_userq_wait_for_last_fence(queue);
>>>     /* Cancel any pending hang detection work and cleanup */
>>>     if (queue->hang_detect_fence) {
>>> -           cancel_delayed_work_sync(&queue->hang_detect_work);
>>> +           hang_detect_fence = true;
>>>             queue->hang_detect_fence = NULL;
>>>     }
>>> +   mutex_unlock(&uq_mgr->userq_mutex);
>>> +
>>> +   if (hang_detect_fence)
>>> +           cancel_delayed_work_sync(&queue->hang_detect_work);

I think you can drop the hang_detect_fence check and just always call 
cancel_delayed_work_sync(&queue->hang_detect_work) before taking the lock 
unconditionally.

Tacking, droping and re-taking the lock is ok here but usually points to some 
questionable handling.

Regards,
Christian.

>>> +
>>> +   mutex_lock(&uq_mgr->userq_mutex);
>>>     r = amdgpu_bo_reserve(queue->db_obj.obj, true);
>>>     if (!r) {
>>>             amdgpu_bo_unpin(queue->db_obj.obj);

Re: [PATCH v1] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work

Reply via email to