amdgpu: abort fence poll if reset is started

Liu, Shaoyun Mon, 03 Jun 2024 11:28:30 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

Thanks Christian for the detail explanation.
I checked your patch , you try to use query_scheduler_status package  to check 
the command completion . It  may not work as expected since this API query the 
status is for MES itself , so mes can update the fence address with  the 
expected seq value, but the  command  itself (ex .remove_queue for mes and  
then  mes send the  unmap_queue to kiq internally)  still fails.
For mes , driver always poll for the command completion ,  do you think it's an 
acceptable solution that MES set a specific failure value(ex , -1)  in the 
fence address to indicate the failure of the  operation ?  But that should be 
similar to let driver poll the completion till timeout .  MES internally also 
need to wait for a timeout on some  command that it sent  to CP (ex.  2 seconds 
for unmap_queue command).  I'm actually a little bit confused here , has driver 
use the lock to ensure there is only one submission into MES at any time ?
 MES can also trigger the interrupt on the failure if driver side require us to 
do so , the  payload will have the seq number to indicate which submission 
cause the failure , that might requires more code change from   driver side 
.Please let me what's preferred from driver side.

Regards
Shaoyun.liu

-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Monday, June 3, 2024 6:59 AM
To: Liu, Shaoyun <shaoyun....@amd.com>; Christian König 
<ckoenig.leichtzumer...@gmail.com>; Li, Yunxiang (Teddy) <yunxiang...@amd.com>; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander <alexander.deuc...@amd.com>; 
Xiao, Hua <hua.x...@amd.com>
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

Hi Shaoyun,

yes my thinking goes into the same direction. The basic problem here is that we 
are trying to stuff two different information into the same variable.

The first information is if the commands haven been read by the MES from the 
ring buffer. This information is necessary for the normal ring buffer and reset 
handling, e.g. prevents ring buffer overflow, ordering of command, lockups 
during reset etc...

The second information is if a certain operation was successfully or not. For 
example this is necessary to get signaled back if y queue map/unmap operation 
has been successfully or if the CP not responding or any other error has 
happened etc...

Another issue is that while it is in general a good idea to have the firmware 
work in a way where errors are reported instead of completely stopping all 
processing, here we run into trouble because the driver usually assumes that 
work can be scheduled on the ring buffer and a subsequent work is processed 
only when everything previously submitted has completed successfully.

So as initial fix for the issue we see I've send Alex a patch on Friday to 
partially revert his change to use an individual writeback for each submission. 
Instead we will submit an addition QUERY_STATUS command after the real command 
and let that one write fence value. This way the fence value is always written, 
independent of the result of the operation.

Additional to that we need to insert something like a dependency between 
submissions, e.g. when you have commands A, B and C on the ring and C can only 
execute when A was successfully then we need to somehow tell that the MES. Only 
other alternative is to not scheduler commands behind each other on the ring 
and that in turn is a bad idea from the performance point of view.

Regards,
Christian.

Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian
>
> I think we have a discussion about this before . Alex also have a change that 
> allow driver to use different write back address for the fence for each 
> submission for the  original issue .
>  From MES  point of view ,  MES will update the fence when the API can be 
> complete successfully, so if the  API (ex . remove_queue) fails  due to  
> other component issue (ex , CP hang), the  MES will not update the fence In 
> this situation , but  MES itself still works and can respond to other 
> commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence 
> for each API without mess around them  .  If you expect MES to stop 
> responding  to further commands  after one API fails , that will introduce 
> combability issue since this design already exist on products for customer 
> and MES also need to works for windows .  Also MES  always need to respond to 
>  some commands like  RESET  etc  that might make things worse if we need to 
> change the logic .
>
> One possible solution is MES can  trigger an Interrupt  to indicate which 
> submission has failed with the seq number . In this case driver can get the  
> failure of the  submission to MES in time and  make its own decision for what 
> to do next , What do you think about this ?
>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
> Christian König
> Sent: Wednesday, May 29, 2024 11:19 AM
> To: Li, Yunxiang (Teddy) <yunxiang...@amd.com>; Koenig, Christian
> <christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is
> started
>
> Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>> Yeah, I know. That's one of the reason I've pointed out on the patch
>>> adding that that this behavior is actually completely broken.
>>>
>>> If you run into issues with the MES because of this then please
>>> suggest a revert of that patch.
>> I think it just need to be improved to allow this force-signal behavior. The 
>> current behavior is slow/inconvenient, but the old behavior is wrong. Since 
>> MES will continue process submissions even when one submission failed. So 
>> with just one fence location there's no way to tell if a command failed or 
>> not.
> No the MES behavior is broken. When a submission failed it should stop 
> processing or signal that the operation didn't completed through some other 
> mechanism.
>
> Just not writing the fence and continuing results in tons of problems, from 
> the TLB fence all the way to the ring buffer and reset handling.
>
> This is a hard requirement and really can't be changed.
>
> Regards,
> Christian.

RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

Reply via email to