amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.

Kuehling, Felix Fri, 30 Aug 2019 13:09:09 -0700

On 2019-08-30 12:39 p.m., Andrey Grodzovsky wrote:
> Issue 1:
> In  XGMI case amdgpu_device_lock_adev for other devices in hive
> was called to late, after access to their repsective schedulers.
> So relocate the lock to the begining of accessing the other devs.
>
> Issue 2:
> Using amdgpu_device_ip_need_full_reset to switch the device list from
> all devices in hive to the single 'master' device who owns this reset
> call is wrong because when stopping schedulers we iterate all the devices
> in hive but when restarting we will only reactivate the 'master' device.
> Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS
> needed we then have to stop schedulers for all devices in hive and not
> only the 'master' but with amdgpu_device_ip_need_full_reset  we
> already missed the opprotunity do to so. So just remove this logic and
> always stop and start all schedulers for all devices in hive.
>
> Also minor cleanup and print fix.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>


Minor nit-pick inline. With that fixed this patch is Acked-by: Felix 
Kuehling <felix.kuehl...@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++--------------
>   1 file changed, 11 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a5daccc..19f6624 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3814,15 +3814,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>               device_list_handle = &device_list;
>       }
>   
> -     /*
> -      * Mark these ASICs to be reseted as untracked first
> -      * And add them back after reset completed
> -      */
> -     list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)
> -             amdgpu_unregister_gpu_instance(tmp_adev);
> -
>       /* block all schedulers and reset given job's ring */
>       list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> +             if (tmp_adev != adev)
> +                     amdgpu_device_lock_adev(tmp_adev, false);
> +             /*
> +              * Mark these ASICs to be reseted as untracked first
> +              * And add them back after reset completed
> +              */
> +             amdgpu_unregister_gpu_instance(tmp_adev);
> +
>               /* disable ras on ALL IPs */
>               if (amdgpu_device_ip_need_full_reset(tmp_adev))
>                       amdgpu_ras_suspend(tmp_adev);
> @@ -3848,9 +3849,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>           dma_fence_is_signaled(job->base.s_fence->parent))
>               job_signaled = true;
>   
> -     if (!amdgpu_device_ip_need_full_reset(adev))
> -             device_list_handle = &device_list;
> -
>       if (job_signaled) {
>               dev_info(adev->dev, "Guilty job already signaled, skipping HW 
> reset");
>               goto skip_hw_reset;
> @@ -3869,10 +3867,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   retry:      /* Rest of adevs pre asic reset from XGMI hive. */
>       list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>   
> -             if (tmp_adev == adev)
> +             if(tmp_adev == adev)

The space before ( was correct coding style. This will trigger a 
checkpatch error or warning.


>                       continue;
>   
> -             amdgpu_device_lock_adev(tmp_adev, false);
>               r = amdgpu_device_pre_asic_reset(tmp_adev,
>                                                NULL,
>                                                &need_full_reset);
> @@ -3921,10 +3918,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   
>               if (r) {
>                       /* bad news, how to tell it to userspace ? */
> -                     dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", 
> atomic_read(&adev->gpu_reset_counter));
> +                     dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", 
> atomic_read(&tmp_adev->gpu_reset_counter));
>                       amdgpu_vf_error_put(tmp_adev, 
> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>               } else {
> -                     dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", 
> atomic_read(&adev->gpu_reset_counter));
> +                     dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", 
> atomic_read(&tmp_adev->gpu_reset_counter));
>               }
>   
>               amdgpu_device_unlock_adev(tmp_adev);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.

Reply via email to