On 2025-01-31 11:58, Xiaogang.Chen wrote:
> From: Xiaogang Chen <xiaogang.c...@amd.com>
>
> When register a vm range at svm the added vm range may be split into multiple
> subranges and/or existing pranges got spitted. The new pranges need validated
> and mapped. This patch changes error handling for pranges that fail updating:

It may help if you clearly state the problem you're trying to solve to justify 
the changes in error handling. See more comments inline.


>
> 1: free prange resources and remove it from svms if its updating fails as it
> will not be used.
> 2: return -EAGAIN when all pranges at update_list need redo valid/map,
> otherwise return no -EAGAIN error to user space to indicate failure. That
> removes unnecessary retries.
>
> Signed-off-by: Xiaogang Chen <xiaogang.c...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index e32e19196f6b..455cb98bf16a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3716,8 +3716,19 @@ svm_range_set_attr(struct kfd_process *p, struct 
> mm_struct *mm,
>  
>  out_unlock_range:
>               mutex_unlock(&prange->migrate_mutex);
> -             if (r)
> -                     ret = r;
> +             /* this prange cannot be migraed, valid or map */
> +             if (r) {
> +                     /* free this prange resources, remove it from svms */
> +                     svm_range_unlink(prange);
> +                     svm_range_remove_notifier(prange);
> +                     svm_range_free(prange, false);

Freeing the prange removes SVM mappings from the process. This will break the 
subsequent execution of the application. In case you were going to return 
-EAGAIN that's definitely wrong because the application would expect the SVM 
range to work after a successful retry.

I'm not sure the resource waste is a valid argument in case of a fatal error. I 
would expect the application to terminate anyways in this case, which would 
result in freeing the resources. Do you see a scenario where an application 
wants to continue running after this function returned a fatal error?


> +
> +                     /* ret got update when any r != -EAGAIN;
> +                      * return -EAGAIN when all pranges at update_list
> +                      * need redo valid/map */
> +                     if (r != -EAGAIN || !ret)
> +                             ret = r;

This is a good point. But the explanation is a bit misleading: "all pranges ... 
need redo" is not really true. There may also be ranges that were validated 
successfully. I think the point you're trying to make is this: Don't return 
-EAGAIN if there was any previous fatal error found.

I could potentially see a different optimization. If you encounter a fatal 
error, you can skip the rest of the ranges and return the error immediately.


> +             }
>       }
>  
>       list_for_each_entry(prange, &remap_list, update_list) {
> @@ -3729,8 +3740,16 @@ svm_range_set_attr(struct kfd_process *p, struct 
> mm_struct *mm,
>               if (r)
>                       pr_debug("failed %d on remap svm range\n", r);
>               mutex_unlock(&prange->migrate_mutex);
> -             if (r)
> -                     ret = r;
> +
> +             if (r) {
> +                     /* remove this prange */
> +                     svm_range_unlink(prange);
> +                     svm_range_remove_notifier(prange);
> +                     svm_range_free(prange, false);

Same as above.

Regards,
  Felix


> +
> +                     if (r != -EAGAIN || !ret)
> +                             ret = r;
> +             }
>       }
>  
>       dynamic_svm_range_dump(svms);

Reply via email to