amdgpu: Move amdgpu_ras_recovery_init to after SMU ready.

Chen, Guchun Tue, 22 Oct 2019 18:24:22 -0700

Hi Andrey,

No, I don't want to see amdgpu_ras_recovery_init being called at different 
places.
If calling amdgpu_ras_recovery_init as much early as we can is not doable, due 
to arcturus i2c code ready timeline, I am fine with this patch.


Regards,
Guchun

-----Original Message-----
From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> 
Sent: Tuesday, October 22, 2019 10:33 PM
To: Chen, Guchun <guchun.c...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao <tao.zh...@amd.com>; Deucher, Alexander 
<alexander.deuc...@amd.com>; noreply-conflue...@amd.com; Quan, Evan 
<evan.q...@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU 
ready.

I am well aware that we want to do it ASAP, but what is your suggestion for the 
Arcturus use case where we have to do it AFTER SMU is up and running ? Do you 
want to call amdgpu_ras_recovery_init in two different places depending if this 
is Vega 20 or Arcturus ? This will over complicate an already complicated init 
sequence of RAS.

Andrey

On 10/20/19 9:51 PM, Chen, Guchun wrote:
> I don't think postpone RAS recovery init is not one reasonable proposal. What 
> we do in recovery init is to read the retired page if there is, and retire 
> corresponding memory, this will make sure these pages are reserved well 
> before setting up memory manager and reserving other memory blocks, it will 
> be safe.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> Sent: Saturday, October 19, 2019 4:49 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chen, Guchun <guchun.c...@amd.com>; Zhou1, Tao 
> <tao.zh...@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com>; 
> noreply-conflue...@amd.com; Quan, Evan <evan.q...@amd.com>; 
> Grodzovsky, Andrey <andrey.grodzov...@amd.com>
> Subject: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU 
> ready.
>
> For Arcturus the I2C traffic is done through SMU tables and so we must 
> postpone RAS recovery init to after they are ready which is in 
> amdgpu_device_ip_hw_init_phase2.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 11 -----------
>   2 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 17cfdaf..c40e9a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1850,6 +1850,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
> *adev)
>       if (r)
>               goto init_failed;
>   
> +     /*
> +      * retired pages will be loaded from eeprom and reserved here,
> +      * it should be called after amdgpu_device_ip_hw_init_phase2  since
> +      * for some ASICs the RAS EEPROM code relies on SMU fully functioning
> +      * for I2C communication which only true at this point.
> +      * recovery_init may fail, but it can free all resources allocated by
> +      * itself and its failure should not stop amdgpu init process.
> +      *
> +      * Note: theoretically, this should be called before all vram 
> allocations
> +      * to protect retired page from abusing
> +      */
> +     amdgpu_ras_recovery_init(adev);
> +
>       if (adev->gmc.xgmi.num_physical_nodes > 1)
>               amdgpu_xgmi_add_device(adev);
>       amdgpu_amdkfd_device_init(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 2e85a51..1045c3f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1721,17 +1721,6 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)  
> #endif
>   
>       /*
> -      * retired pages will be loaded from eeprom and reserved here,
> -      * it should be called after ttm init since new bo may be created,
> -      * recovery_init may fail, but it can free all resources allocated by
> -      * itself and its failure should not stop amdgpu init process.
> -      *
> -      * Note: theoretically, this should be called before all vram 
> allocations
> -      * to protect retired page from abusing
> -      */
> -     amdgpu_ras_recovery_init(adev);
> -
> -     /*
>        *The reserved vram for firmware must be pinned to the specified
>        *place on the VRAM, so reserve it early.
>        */
> --
> 2.7.4
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU ready.

Reply via email to