Hi Shaoyun,
see inline.
Am 03.06.24 um 20:28 schrieb Liu, Shaoyun:
[AMD Official Use Only - AMD Internal Distribution Only]
Thanks Christian for the detail explanation.
I checked your patch , you try to use query_scheduler_status package to check
the command completion . It may not work as expected since this API query the
status is for MES itself , so mes can update the fence address with the
expected seq value, but the command itself (ex .remove_queue for mes and
then mes send the unmap_queue to kiq internally) still fails.
And that is exactly the desired behavior.
See the fence value is for ring buffer processing and getting feedback
in the case of a reset for example if the MES has processed the commands.
If that processing is successfully or not *must* be completely
irrelevant for writing the fence value.
For mes , driver always poll for the command completion
No, exactly that's what we want to avoid as much as possible.
Polling means that we throw away tons of CPU cycles and especially on
fault handling and TLB flushing that is an absolutely in-acceptable
performance loss.
, do you think it's an acceptable solution that MES set a specific failure
value(ex , -1) in the fence address to indicate the failure of the operation
? But that should be similar to let driver poll the completion till timeout .
MES internally also need to wait for a timeout on some command that it sent
to CP (ex. 2 seconds for unmap_queue command).
No, what we should really do is to separate the fence and the result
values. And then give an input dependency on each operation.
I'm actually a little bit confused here , has driver use the lock to ensure
there is only one submission into MES at any time ?
No, exactly that's what we try to avoid. Othertwise we don't need a ring
buffer in the first place.
MES can also trigger the interrupt on the failure if driver side require us
to do so , the payload will have the seq number to indicate which submission
cause the failure , that might requires more code change from driver side
.Please let me what's preferred from driver side.
I can come up with a more detailed explanation of the driver
requirements when I'm back from sick leave.
Regards,
Christian.
Regards
Shaoyun.liu
-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Monday, June 3, 2024 6:59 AM
To: Liu, Shaoyun <shaoyun....@amd.com>; Christian König <ckoenig.leichtzumer...@gmail.com>; Li,
Yunxiang (Teddy) <yunxiang...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander
<alexander.deuc...@amd.com>; Xiao, Hua <hua.x...@amd.com>
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
Hi Shaoyun,
yes my thinking goes into the same direction. The basic problem here is that we
are trying to stuff two different information into the same variable.
The first information is if the commands haven been read by the MES from the
ring buffer. This information is necessary for the normal ring buffer and reset
handling, e.g. prevents ring buffer overflow, ordering of command, lockups
during reset etc...
The second information is if a certain operation was successfully or not. For
example this is necessary to get signaled back if y queue map/unmap operation
has been successfully or if the CP not responding or any other error has
happened etc...
Another issue is that while it is in general a good idea to have the firmware
work in a way where errors are reported instead of completely stopping all
processing, here we run into trouble because the driver usually assumes that
work can be scheduled on the ring buffer and a subsequent work is processed
only when everything previously submitted has completed successfully.
So as initial fix for the issue we see I've send Alex a patch on Friday to
partially revert his change to use an individual writeback for each submission.
Instead we will submit an addition QUERY_STATUS command after the real command
and let that one write fence value. This way the fence value is always written,
independent of the result of the operation.
Additional to that we need to insert something like a dependency between
submissions, e.g. when you have commands A, B and C on the ring and C can only
execute when A was successfully then we need to somehow tell that the MES. Only
other alternative is to not scheduler commands behind each other on the ring
and that in turn is a bad idea from the performance point of view.
Regards,
Christian.
Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
[AMD Official Use Only - AMD Internal Distribution Only]
Hi, Christian
I think we have a discussion about this before . Alex also have a change that
allow driver to use different write back address for the fence for each
submission for the original issue .
From MES point of view , MES will update the fence when the API can be
complete successfully, so if the API (ex . remove_queue) fails due to other
component issue (ex , CP hang), the MES will not update the fence In this
situation , but MES itself still works and can respond to other commands (ex
,,read_reg) . Alex's change allow driver to check the fence for each API
without mess around them . If you expect MES to stop responding to further
commands after one API fails , that will introduce combability issue since
this design already exist on products for customer and MES also need to works
for windows . Also MES always need to respond to some commands like RESET
etc that might make things worse if we need to change the logic .
One possible solution is MES can trigger an Interrupt to indicate which
submission has failed with the seq number . In this case driver can get the
failure of the submission to MES in time and make its own decision for what
to do next , What do you think about this ?
Regards
Shaoyun.liu
-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
Christian König
Sent: Wednesday, May 29, 2024 11:19 AM
To: Li, Yunxiang (Teddy) <yunxiang...@amd.com>; Koenig, Christian
<christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is
started
Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
[AMD Official Use Only - AMD Internal Distribution Only]
Yeah, I know. That's one of the reason I've pointed out on the patch
adding that that this behavior is actually completely broken.
If you run into issues with the MES because of this then please
suggest a revert of that patch.
I think it just need to be improved to allow this force-signal behavior. The
current behavior is slow/inconvenient, but the old behavior is wrong. Since MES
will continue process submissions even when one submission failed. So with just
one fence location there's no way to tell if a command failed or not.
No the MES behavior is broken. When a submission failed it should stop
processing or signal that the operation didn't completed through some other
mechanism.
Just not writing the fence and continuing results in tons of problems, from the
TLB fence all the way to the ring buffer and reset handling.
This is a hard requirement and really can't be changed.
Regards,
Christian.