Re: [PATCH] drm/amdkfd: Fix circular lock in nocpsch path

Pan, Xinhui Wed, 16 Jun 2021 02:45:06 -0700


> 2021年6月16日 12:36，Kuehling, Felix <felix.kuehl...@amd.com> 写道：
> 
> Am 2021-06-16 um 12:01 a.m. schrieb Pan, Xinhui:
>>> 2021年6月16日 02:22，Kuehling, Felix <felix.kuehl...@amd.com> 写道：
>>> 
>>> [+Xinhui]
>>> 
>>> 
>>> Am 2021-06-15 um 1:50 p.m. schrieb Amber Lin:
>>>> Calling free_mqd inside of destroy_queue_nocpsch_locked can cause a
>>>> circular lock. destroy_queue_nocpsch_locked is called under a DQM lock,
>>>> which is taken in MMU notifiers, potentially in FS reclaim context.
>>>> Taking another lock, which is BO reservation lock from free_mqd, while
>>>> causing an FS reclaim inside the DQM lock creates a problematic circular
>>>> lock dependency. Therefore move free_mqd out of
>>>> destroy_queue_nocpsch_locked and call it after unlocking DQM.
>>>> 
>>>> Signed-off-by: Amber Lin <amber....@amd.com>
>>>> Reviewed-by: Felix Kuehling <felix.kuehl...@amd.com>
>>> Let's submit this patch as is. I'm making some comments inline for
>>> things that Xinhui can address in his race condition patch.
>>> 
>>> 
>>>> ---
>>>> .../drm/amd/amdkfd/kfd_device_queue_manager.c  | 18 +++++++++++++-----
>>>> 1 file changed, 13 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>> index 72bea5278add..c069fa259b30 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>> @@ -486,9 +486,6 @@ static int destroy_queue_nocpsch_locked(struct 
>>>> device_queue_manager *dqm,
>>>>    if (retval == -ETIME)
>>>>            qpd->reset_wavefronts = true;
>>>> 
>>>> -
>>>> -  mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj);
>>>> -
>>>>    list_del(&q->list);
>>>>    if (list_empty(&qpd->queues_list)) {
>>>>            if (qpd->reset_wavefronts) {
>>>> @@ -523,6 +520,8 @@ static int destroy_queue_nocpsch(struct 
>>>> device_queue_manager *dqm,
>>>>    int retval;
>>>>    uint64_t sdma_val = 0;
>>>>    struct kfd_process_device *pdd = qpd_to_pdd(qpd);
>>>> +  struct mqd_manager *mqd_mgr =
>>>> +          dqm->mqd_mgrs[get_mqd_type_from_queue_type(q->properties.type)];
>>>> 
>>>>    /* Get the SDMA queue stats */
>>>>    if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
>>>> @@ -540,6 +539,8 @@ static int destroy_queue_nocpsch(struct 
>>>> device_queue_manager *dqm,
>>>>            pdd->sdma_past_activity_counter += sdma_val;
>>>>    dqm_unlock(dqm);
>>>> 
>>>> +  mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj);
>>>> +
>>>>    return retval;
>>>> }
>>>> 
>>>> @@ -1629,7 +1630,7 @@ static bool set_cache_memory_policy(struct 
>>>> device_queue_manager *dqm,
>>>> static int process_termination_nocpsch(struct device_queue_manager *dqm,
>>>>            struct qcm_process_device *qpd)
>>>> {
>>>> -  struct queue *q, *next;
>>>> +  struct queue *q;
>>>>    struct device_process_node *cur, *next_dpn;
>>>>    int retval = 0;
>>>>    bool found = false;
>>>> @@ -1637,12 +1638,19 @@ static int process_termination_nocpsch(struct 
>>>> device_queue_manager *dqm,
>>>>    dqm_lock(dqm);
>>>> 
>>>>    /* Clear all user mode queues */
>>>> -  list_for_each_entry_safe(q, next, &qpd->queues_list, list) {
>>>> +  while (!list_empty(&qpd->queues_list)) {
>>>> +          struct mqd_manager *mqd_mgr;
>>>>            int ret;
>>>> 
>>>> +          q = list_first_entry(&qpd->queues_list, struct queue, list);
>>>> +          mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
>>>> +                          q->properties.type)];
>>>>            ret = destroy_queue_nocpsch_locked(dqm, qpd, q);
>>>>            if (ret)
>>>>                    retval = ret;
>>>> +          dqm_unlock(dqm);
>>>> +          mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj);
>>>> +          dqm_lock(dqm);
>>> This is the correct way to clean up the list when dropping the dqm-lock
>>> in the middle. Xinhui, you can use the same method in
>>> process_termination_cpsch.
>>> 
>> yes, that is the right way to walk through the list. thanks.
>> 
>> 
>>> I believe the swapping of the q->mqd with a temporary variable is not
>>> needed. When free_mqd is called, the queue is no longer on the
>>> qpd->queues_list, so destroy_queue cannot race with it. If we ensure
>>> that queues are always removed from the list before calling free_mqd,
>>> and that list-removal happens under the dqm_lock, then there should be
>>> no risk of a race condition that causes a double-free.
>>> 
>> no, the double free exists because pqm_destroy_queue fetch the queue from 
>> qid by get_queue_by_qid()
>> the race is like below.
>> pqm_destroy_queue
>>      get_queue_by_qid                                
>> process_termination_cpsch
>>      destroy_queue_cpsch
>>                                                              lock
>>                                                              
>> list_for_each_entry_safe
>>                                                                      
>> list_del(q)
>>                                                              unlock
>>                                                              free_mqd
>>      lock
>>      list_del(q)
>>      unlock
>>      free_mqd
> 
> I think if both those threads try to free the same queue, they both need
> to hold the same process->mutex. For pqm_destroy_queue that happens in
> kfd_ioctl_destroy_queue. For process_termination_cpsch that happens in
> kfd_process_notifier_release before it calls
> kfd_process_dequeue_from_all_devices.
oh, yes, you are right.
So the double free I am seeing has different root cause. :(


> 
> Regards,
>   Felix
> 
> 
>>      
>> 
>> 
>> 
>>> Regards,
>>>  Felix
>>> 
>>> 
>>>>    }
>>>> 
>>>>    /* Unregister process */

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdkfd: Fix circular lock in nocpsch path

Reply via email to