Am 2021-08-05 um 4:51 a.m. schrieb Zhu, Changfeng:
> [AMD Official Use Only]
>
> Hi Felix,
>
> Can we set noretry=1 for dgpu path(ignore_crat=1) which doesn’t to through 
> iommuv2 path on raven as below:

There are other possible reasons than ignore_crat for Raven to work in
dGPU mode (broken CRAT, disabled IOMMU). However, those are not known
until later in the initialization.

Regards,
  Felix


>> +    case CHIP_RAVEN:
>> +            /*
>> +             * TODO: Raven currently can fix most SVM issues with
>> +             * noretry =1. However it has two issues with noretry = 1
>> +             * on kfd migrate tests. It still needs to root causes
>> +             * with these two migrate fails on raven with noretry = 1.
>> +             */
>>              if (amdgpu_noretry == -1) {
>>                      If(ignore_crat)
>>                              gmc->noretry = 1;
>>                      else
>>                              gmc->noretry = 0;
>>              }
>>              else
>>                      gmc->noretry = amdgpu_noretry;
>>              break;
> BR,
> Changfeng.
>
> -----Original Message-----
> From: Kuehling, Felix <felix.kuehl...@amd.com> 
> Sent: Wednesday, July 28, 2021 10:22 PM
> To: Zhu, Changfeng <changfeng....@amd.com>; amd-gfx@lists.freedesktop.org; 
> Huang, Ray <ray.hu...@amd.com>; Zhang, Yifan <yifan1.zh...@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues 
> for raven
>
> Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for system 
> memory access depend on correct retry handling in the SQ.
> Therefore noretry must be 0 on Raven.
>
> I believe the reason that SVM has trouble with retry enabled is, that
> IOMMUv2 is catching the page faults, so the driver never gets to handle the 
> page fault interrupts. That breaks page-fault based migration in the SVM 
> code. I think the better solution is to disable SVM on APUs where
> IOMMUv2 is enabled.
>
> Alternatively, we could give up on IOMMUv2 entirely and always rely on SVM to 
> provide that functionality. But that requires more changes in the amdgpu_vm 
> code.
>
> Regards,
>   Felix
>
>
> Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
>> From: changzhu <changfeng....@amd.com>
>>
>> From: Changfeng <changfeng....@amd.com>
>>
>> It can't find any issues with noretry=1 except two SVM migrate issues.
>> Oppositely, it will cause most SVM cases fail with noretry=0.
>> The two SVM migrate issues also happen with noretry=0. So it can set 
>> default noretry=1 for raven firstly to fix most SVM fails.
>>
>> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
>> Signed-off-by: Changfeng <changfeng....@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 09edfb64cce0..d7f69dbd48e6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>>               * noretry = 0 will cause kfd page fault tests fail
>>               * for some ASICs, so set default to 1 for these ASICs.
>>               */
>> +    case CHIP_RAVEN:
>> +            /*
>> +             * TODO: Raven currently can fix most SVM issues with
>> +             * noretry =1. However it has two issues with noretry = 1
>> +             * on kfd migrate tests. It still needs to root causes
>> +             * with these two migrate fails on raven with noretry = 1.
>> +             */
>>              if (amdgpu_noretry == -1)
>>                      gmc->noretry = 1;
>>              else
>>                      gmc->noretry = amdgpu_noretry;
>>              break;
>> -    case CHIP_RAVEN:
>>      default:
>> -            /* Raven currently has issues with noretry
>> -             * regardless of what we decide for other
>> -             * asics, we should leave raven with
>> -             * noretry = 0 until we root cause the
>> -             * issues.
>> -             *
>> +            /*
>>               * default this to 0 for now, but we may want
>>               * to change this in the future for certain
>>               * GPUs as it can increase performance in

Reply via email to