>>> We explicitly added the asynchronously IB test for SRIOV to make driver 
>>> load faster. Why is that now a problem?
Well I didn't notice this change explicitly for SRIOV, at least from the code 
it is quite a normal change 


>>> And why would it help when the VM shuts down? We cancel/flush the test 
>>> during driver unload/suspend as well.
If you do the sequence like: load amdgpu and shutdown VM immediately , you will 
hit problems .

Virtual Machine's shutdown won't be blocked if the IB test is on the fly , 
e.g.: you can do "init 0" right after "modprobe amdgpu" and the 
"amdgpu_device_fini" won't be called 
Thus the flushing on IB test won't happen so the IB test is still running with 
the VM already shutdown and lead to GPU crash (usually VCN/VCE crash on invalid 
memory address)

 
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Christian König <ckoenig.leichtzumer...@gmail.com> 
Sent: Monday, June 29, 2020 4:18 PM
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV

Am 29.06.20 um 09:11 schrieb Monk Liu:
> From: pengzhou <pengju.z...@amd.com>
>
> issue:
> originally we kickoff IB test asynchronously with driver's init, thus 
> the IB test may still running when the driver loading done (modprobe amdgpu 
> done).
> if we shutdown VM immediately after amdgpu driver loaded then GPU may 
> hang because the IB test is still running
>
> fix:
> make IB test synchronize with driver init thus it won't still running 
> when we shutdown the VM.

We explicitly added the asynchronously IB test for SRIOV to make driver load 
faster. Why is that now a problem?

And why would it help when the VM shuts down? We cancel/flush the test during 
driver unload/suspend as well.

>
> Signed-off-by: Monk Liu <monk....@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 
> ++++++++++++++++++++++++-----
>   1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 457f5d2..4f54660 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>       /* must succeed. */
>       amdgpu_ras_resume(adev);
>   
> -     queue_delayed_work(system_wq, &adev->delayed_init_work,
> +     if (amdgpu_sriov_vf(adev)) {
> +             r = amdgpu_ib_ring_tests(adev);
> +             if (r) {
> +                     DRM_ERROR("ib ring test failed (%d).\n", r);
> +                     return r;
> +             }
> +     } else {
> +             queue_delayed_work(system_wq, &adev->delayed_init_work,
>                          msecs_to_jiffies(AMDGPU_RESUME_MS));
> +     }
>   
>       r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
>       if (r) {
> @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>       int r;
>   
>       DRM_INFO("amdgpu: finishing device.\n");
> -     flush_delayed_work(&adev->delayed_init_work);
> +     if (!amdgpu_sriov_vf(adev))
> +             flush_delayed_work(&adev->delayed_init_work);

You can drop this change, flushing a work which was never scheduled is harmless.

>       adev->shutdown = true;
>   
>       /* make sure IB test finished before entering exclusive mode @@ 
> -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>       if (fbcon)
>               amdgpu_fbdev_set_suspend(adev, 1);
>   
> -     cancel_delayed_work_sync(&adev->delayed_init_work);
> +     if (!amdgpu_sriov_vf(adev))
> +             cancel_delayed_work_sync(&adev->delayed_init_work);
>   
>       if (!amdgpu_device_has_dc_support(adev)) {
>               /* turn off display hw */
> @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
>       if (r)
>               return r;
>   
> -     queue_delayed_work(system_wq, &adev->delayed_init_work,
> +     if (amdgpu_sriov_vf(adev)) {
> +             r = amdgpu_ib_ring_tests(adev);
> +             if (r) {
> +                     DRM_ERROR("ib ring test failed (%d).\n", r);
> +                     return r;
> +             }
> +     } else {
> +             queue_delayed_work(system_wq, &adev->delayed_init_work,
>                          msecs_to_jiffies(AMDGPU_RESUME_MS));
> +     }
>   
>       if (!amdgpu_device_has_dc_support(adev)) {
>               /* pin cursors */
> @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
>               return r;
>   
>       /* Make sure IB tests flushed */
> -     flush_delayed_work(&adev->delayed_init_work);
> +     if (!amdgpu_sriov_vf(adev))
> +             flush_delayed_work(&adev->delayed_init_work);
>   
>       /* blat the mode back in */
>       if (fbcon) {

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to