Re: [PATCH] drm/amdkfd: Change error handling at prange update in svm_range_set_attr

Chen, Xiaogang Tue, 04 Mar 2025 10:23:23 -0800


On 3/3/2025 11:21 PM, Felix Kuehling wrote:

On 2025-01-31 11:58, Xiaogang.Chen wrote:

From: Xiaogang Chen<xiaogang.c...@amd.com>


When register a vm range at svm the added vm range may be split into multiple
subranges and/or existing pranges got spitted. The new pranges need validated
and mapped. This patch changes error handling for pranges that fail updating:

It may help if you clearly state the problem you're trying to solve to justify 
the changes in error handling. See more comments inline.

Current way is returning the last sub range error code if it got issueduring migration, validation or map. If the last error is -EAGAIN, butthere are other error codes at middle for other sub ranges we stillreturn -EAGAIN. That causes same procedure repeated until the sub rangesthat have other error code becomes the last one.

I noticed it when looked at large range(more than 100GB) registrationwhich split into multiple sub ranges. There were multiple unnecessaryrepeats until hit return code that is no -EAGAIN.

As you said we may return immediately if hit no -EAGAIN, and hope appterminates. But if app does not terminate kfd drive will hold unusedpranges until app stops.

1: free prange resources and remove it from svms if its updating fails as it
will not be used.
2: return -EAGAIN when all pranges at update_list need redo valid/map,
otherwise return no -EAGAIN error to user space to indicate failure. That
removes unnecessary retries.

Signed-off-by: Xiaogang Chen<xiaogang.c...@amd.com>
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++----
  1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index e32e19196f6b..455cb98bf16a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3716,8 +3716,19 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,

out_unlock_range:

                mutex_unlock(&prange->migrate_mutex);
-               if (r)
-                       ret = r;
+               /* this prange cannot be migraed, valid or map */
+               if (r) {
+                       /* free this prange resources, remove it from svms */
+                       svm_range_unlink(prange);
+                       svm_range_remove_notifier(prange);
+                       svm_range_free(prange, false);

Freeing the prange removes SVM mappings from the process. This will break the 
subsequent execution of the application. In case you were going to return 
-EAGAIN that's definitely wrong because the application would expect the SVM 
range to work after a successful retry.

When return -EAGAIN app will do whole range registration again includingrebuild sub ranges. And at this stage we do not know if subsequent subranges will be success or fail. So I release current sub range resourceif it got error(including -EAGAIN). After processing all sub ranges ifdecide to have app do it again, the redo procedure will rebuild thereleased sub ranges.

I'm not sure the resource waste is a valid argument in case of a fatal error. I 
would expect the application to terminate anyways in this case, which would 
result in freeing the resources. Do you see a scenario where an application 
wants to continue running after this function returned a fatal error?

I made a test app to check the behavior of registration of large rangefor debugging a real issue. I do not know if real app will continue torun when hit no -EAGAIN error code. The purpose here is making driverhandle this case more general.

+
+                       /* ret got update when any r != -EAGAIN;
+                        * return -EAGAIN when all pranges at update_list
+                        * need redo valid/map */
+                       if (r != -EAGAIN || !ret)
+                               ret = r;

This is a good point. But the explanation is a bit misleading: "all pranges ... need 
redo" is not really true. There may also be ranges that were validated successfully. 
I think the point you're trying to make is this: Don't return -EAGAIN if there was any 
previous fatal error found.

ok

I could potentially see a different optimization. If you encounter a fatal 
error, you can skip the rest of the ranges and return the error immediately.

As said above it is a another way to return immediately if hit no-EAGAIN. but should kfd driver release unused pragne resources any way?


Regards

Xiaogang

+               }
        }

list_for_each_entry(prange, &remap_list, update_list) {

@@ -3729,8 +3740,16 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
                if (r)
                        pr_debug("failed %d on remap svm range\n", r);
                mutex_unlock(&prange->migrate_mutex);
-               if (r)
-                       ret = r;
+
+               if (r) {
+                       /* remove this prange */
+                       svm_range_unlink(prange);
+                       svm_range_remove_notifier(prange);
+                       svm_range_free(prange, false);

Same as above.

Regards,
   Felix

+
+                       if (r != -EAGAIN || !ret)
+                               ret = r;
+               }
        }

dynamic_svm_range_dump(svms);

Re: [PATCH] drm/amdkfd: Change error handling at prange update in svm_range_set_attr

Reply via email to