The problem is that this re-submitting is currently an integral part of
how SRIOV works.
The host can send a function level reset request to the clients when it
sees that some schedule switching didn't worked as expected and in this
case (and only this case) the hardware has actually never started to
even work on the IBs. So the re-submission is actually save from this side.
But in general you are right, the sw side is just completely broken
because we came up with a bunch of rather strict rules for the dma_fence
implementation (and those rules are perfectly valid and necessary).
Regards,
Christian.
Am 26.10.22 um 18:10 schrieb Liu, Shaoyun:
[AMD Official Use Only - General]
The user space shouldn't care about SRIOV or not , I don't think we need to
keep the re-submission for SRIOV as well. The reset from SRIOV could trigger
the host do a whole GPU reset which will have the same issue as bare metal.
Regards
Shaoyun.liu
-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Christian
König
Sent: Wednesday, October 26, 2022 11:36 AM
To: Tuikov, Luben <luben.tui...@amd.com>; Prosyak, Vitaly <vitaly.pros...@amd.com>;
Deucher, Alexander <alexander.deuc...@amd.com>; daniel.vet...@ffwll.ch;
amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Cc: Koenig, Christian <christian.koe...@amd.com>
Subject: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset
Re-submitting IBs by the kernel has many problems because pre- requisite state
is not automatically re-created as well. In other words neither binary
semaphores nor things like ring buffer pointers are in the state they should be
when the hardware starts to work on the IBs again.
Additional to that even after more than 5 years of developing this feature it
is still not stable and we have massively problems getting the reference counts
right.
As discussed with user space developers this behavior is not helpful in the
first place. For graphics and multimedia workloads it makes much more sense to
either completely re-create the context or at least re-submitting the IBs from
userspace.
For compute use cases re-submitting is also not very helpful since userspace
must rely on the accuracy of the result.
Because of this we stop this practice and instead just properly note that the
fence submission was canceled. The only use case we keep the re-submission for
now is SRIOV and function level resets.
Signed-off-by: Christian König <christian.koe...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d4584e577b51..39e94feba1ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5288,7 +5288,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
continue;
/* No point to resubmit jobs if we didn't HW reset*/
- if (!tmp_adev->asic_reset_res && !job_signaled)
+ if (!tmp_adev->asic_reset_res && !job_signaled &&
+ amdgpu_sriov_vf(tmp_adev))
drm_sched_resubmit_jobs(&ring->sched);
drm_sched_start(&ring->sched,
!tmp_adev->asic_reset_res);
--
2.25.1