amdgpu: reduce reset time

Zhao, Victor Wed, 27 Jul 2022 03:35:28 -0700

[AMD Official Use Only - General]

Hi Andrey,


Problem with status.hang is that it is set at 
amdgpu_device_ip_check_soft_reset, which is not implemented in nv or gfx10. 
They have to be nicely implemented first.
Another option I thought is to mark status.hang or add a flag to amdgpu_gfx 
when job timeout reported on gfx/comp ring. And this will require some logic to 
map the relationship for ring and ip blocks. This way does not look good as 
well.


Thanks,
Victor



-----Original Message-----
From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> 
Sent: Wednesday, July 27, 2022 12:57 AM
To: Zhao, Victor <victor.z...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Deng, Emily 
<emily.d...@amd.com>; Koenig, Christian <christian.koe...@amd.com>
Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time


On 2022-07-26 05:40, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Andrey,
>
> Reply inline.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey <andrey.grodzov...@amd.com>
> Sent: Tuesday, July 26, 2022 5:18 AM
> To: Zhao, Victor <victor.z...@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Deng, Emily 
> <emily.d...@amd.com>; Koenig, Christian <christian.koe...@amd.com>
> Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time
>
>
> On 2022-07-22 03:34, Victor Zhao wrote:
>> In multi container use case, reset time is important, so skip ring 
>> tests and cp halt wait during ip suspending for reset as they are 
>> going to fail and cost more time on reset
>
> Why are they failing in this case ? Skipping ring tests is not the best idea 
> as you loose important indicator of system's sanity. Is there any way to make 
> them work ?
>
> [Victor]: I've seen gfx ring test fail every time after a gfx engine hang. I 
> thought it should be expected as gfx is in a bad state. Do you know the 
> reason we have ring tests before reset? As we are going to reset the asic 
> anyway.
> Another approach could be to make the skip mode2 only or reduce the wait time 
> here.


I dug down in history and according to commit 'drm/amdgpu:unmap KCQ in gfx 
hw_fini(v2)' you need to write to scratch register for completion of queue 
unmap operation so you defently don't want to just skip it. I agree in case 
that the ring is hung this has no point but remember that GPU reset can happen 
not only to a hunged ring but for other reasons (RAS, manual reset e.t.c.) in 
which case you probably want to shut down gracefully here ?
I see we have adev->ip_blocks[i].status.hang flag which you maybe can use here 
instead ?


>
>
>> Signed-off-by: Victor Zhao <victor.z...@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  2 +-
>>    drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 26 +++++++++++++++----------
>>    2 files changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index 222d3d7ea076..f872495ccc3a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
>>              kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
>>                                         RESET_QUEUES, 0, 0);
>>    
>> -    if (adev->gfx.kiq.ring.sched.ready)
>> +    if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev))
>>              r = amdgpu_ring_test_helper(kiq_ring);
>>      spin_unlock(&adev->gfx.kiq.ring_lock);
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> index fafbad3cf08d..9ae29023e38f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> @@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct 
>> amdgpu_device *adev, bool enable)
>>              WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
>>      }
>>    
>> -    for (i = 0; i < adev->usec_timeout; i++) {
>> -            if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>> -                    break;
>> -            udelay(1);
>> -    }
>> -
>> -    if (i >= adev->usec_timeout)
>> -            DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt");
>> +    if (!amdgpu_in_reset(adev)) {
>> +            for (i = 0; i < adev->usec_timeout; i++) {
>> +                    if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>> +                            break;
>> +                    udelay(1);
>> +            }
>>    
>> +            if (i >= adev->usec_timeout)
>> +                    DRM_ERROR("failed to %s cp gfx\n",
>> +                              enable ? "unhalt" : "halt");
>> +    }
>>      return 0;
>> +
>>    }
>
> This change has impact beyond container case no ? We had no issue with this 
> code during regular reset cases so why we would give up on this code which 
> confirms CP is idle ? What is the side effect of skipping this during all GPU 
> resets ?
>
> Andrey
>
> [Victor]: I see "failed to halt cp gfx" with regular reset cases as well when 
> doing a gfx hang test using quark. I haven't seen a side effect with Mode1 
> reset yet but maybe shorten the wait time could be better?


Same as above i guess, it would indeed time out for a hung ring but GPU reset 
happens not only because of hung rings but for other reasons.

Andrey


>
>>    
>>    static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct 
>> amdgpu_device
>> *adev) @@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct 
>> amdgpu_device *adev)
>>      for (i = 0; i < adev->gfx.num_gfx_rings; i++)
>>              kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
>>                                         PREEMPT_QUEUES, 0, 0);
>> -
>> -    return amdgpu_ring_test_helper(kiq_ring);
>> +    if (!amdgpu_in_reset(adev))
>> +            return amdgpu_ring_test_helper(kiq_ring);
>> +    else
>> +            return 0;
>>    }
>>    #endif
>>    
>> @@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle)
>>    
>>              return 0;
>>      }
>> +
>>      gfx_v10_0_cp_enable(adev, false);
>>      gfx_v10_0_enable_gui_idle_interrupt(adev, false);
>>

RE: [PATCH 5/5] drm/amdgpu: reduce reset time

Reply via email to