On 2021-09-29 7:32 p.m., Yu, Lang wrote:
[AMD Official Use Only]



-----Original Message-----
From: Kuehling, Felix <felix.kuehl...@amd.com>
Sent: Wednesday, September 29, 2021 11:25 PM
To: Yu, Lang <lang...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Huang, Ray
<ray.hu...@amd.com>
Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak

Am 2021-09-29 um 4:22 a.m. schrieb Lang Yu:
If user doesn't explicitly call kfd_ioctl_destroy_queue to destroy all
created queues, when the kfd process is destroyed, some queues'
cu_mask memory are not freed.

To avoid forgetting to free them in some places, free them immediately
after use.

Signed-off-by: Lang Yu <lang...@amd.com>
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c               |  8 ++++----
  drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10
++++------
  2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 4de907f3e66a..5c0e6dcf692a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -451,8 +451,8 @@ static int kfd_ioctl_set_cu_mask(struct file *filp, struct
kfd_process *p,
        retval = copy_from_user(properties.cu_mask, cu_mask_ptr,
cu_mask_size);
        if (retval) {
                pr_debug("Could not copy CU mask from userspace");
-               kfree(properties.cu_mask);
-               return -EFAULT;
+               retval = -EFAULT;
+               goto out;
        }

        mutex_lock(&p->mutex);
@@ -461,8 +461,8 @@ static int kfd_ioctl_set_cu_mask(struct file
*filp, struct kfd_process *p,

        mutex_unlock(&p->mutex);

-       if (retval)
-               kfree(properties.cu_mask);
+out:
+       kfree(properties.cu_mask);

        return retval;
  }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 243dd1efcdbf..4c81d690f31a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -394,8 +394,6 @@ int pqm_destroy_queue(struct
process_queue_manager *pqm, unsigned int qid)
                        pdd->qpd.num_gws = 0;
                }

-               kfree(pqn->q->properties.cu_mask);
-               pqn->q->properties.cu_mask = NULL;
                uninit_queue(pqn->q);
        }

@@ -448,16 +446,16 @@ int pqm_set_cu_mask(struct
process_queue_manager *pqm, unsigned int qid,
                return -EFAULT;
        }

-       /* Free the old CU mask memory if it is already allocated, then
-        * allocate memory for the new CU mask.
-        */
-       kfree(pqn->q->properties.cu_mask);
+       WARN_ON_ONCE(pqn->q->properties.cu_mask);

        pqn->q->properties.cu_mask_count = p->cu_mask_count;
        pqn->q->properties.cu_mask = p->cu_mask;

        retval = pqn->q->device->dqm->ops.update_queue(pqn->q->device->dqm,
                                                        pqn->q);
+
+       pqn->q->properties.cu_mask = NULL;
+
This won't work correctly. We need to save the cu_mask for later.
Otherwise the next time dqm->ops.update_queue is called, for example in
pqm_update_queue or pqm_set_gws, it will wipe out the CU mask in the MQD.
Let's just return when meeting a null cu_mask in update_cu_mask() to avoid that.
Like following,

static void update_cu_mask(struct mqd_manager *mm, void *mqd,
                           struct queue_properties *q)
{
        struct v10_compute_mqd *m;
        uint32_t se_mask[4] = {0}; /* 4 is the max # of SEs */

        if (!q-> cu_mask || q->cu_mask_count == 0)
                return;
        ......
}

Is this fine with you? Thanks!

I think that could work. I still don't like it. It leaves the CU mask in the q->properties structure, but it's only ever used temporarily and doesn't need to be persistent. I'd argue, in this case, the cu_mask shouldn't be in the q->properties structure at all, but should be passed as an optional parameter into the dqm->ops.update_queue call.

But I think a simpler fix would be to move the freeing of the CU mask into uninit_queue. That would catch all cases where a queue gets destroyed, including the process termination case.

Regards,
  Felix



Regards,
Lang
Regards,
   Felix


        if (retval != 0)
                return retval;

Reply via email to