[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian

I think we have a discussion about this before . Alex also have a change that 
allow driver to use different write back address for the fence for each 
submission for the  original issue .
From MES  point of view ,  MES will update the fence when the API can be 
complete successfully, so if the  API (ex . remove_queue) fails  due to  other 
component issue (ex , CP hang), the  MES will not update the fence In this 
situation , but  MES itself still works and can respond to other commands (ex 
,,read_reg)  .  Alex's change allow driver to check the fence for each API 
without mess around them  .  If you expect MES to stop responding  to further 
commands  after one API fails , that will introduce combability issue since 
this design already exist on products for customer and MES also need to works 
for windows .  Also MES  always need to respond to  some commands like  RESET  
etc  that might make things worse if we need to change the logic .

One possible solution is MES can  trigger an Interrupt  to indicate which 
submission has failed with the seq number . In this case driver can get the  
failure of the  submission to MES in time and  make its own decision for what 
to do next , What do you think about this ?

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Christian 
König
Sent: Wednesday, May 29, 2024 11:19 AM
To: Li, Yunxiang (Teddy) <yunxiang...@amd.com>; Koenig, Christian 
<christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> Yeah, I know. That's one of the reason I've pointed out on the patch
>> adding that that this behavior is actually completely broken.
>>
>> If you run into issues with the MES because of this then please
>> suggest a revert of that patch.
> I think it just need to be improved to allow this force-signal behavior. The 
> current behavior is slow/inconvenient, but the old behavior is wrong. Since 
> MES will continue process submissions even when one submission failed. So 
> with just one fence location there's no way to tell if a command failed or 
> not.

No the MES behavior is broken. When a submission failed it should stop 
processing or signal that the operation didn't completed through some other 
mechanism.

Just not writing the fence and continuing results in tons of problems, from the 
TLB fence all the way to the ring buffer and reset handling.

This is a hard requirement and really can't be changed.

Regards,
Christian.

Reply via email to