On 6/2/25 06:05, Jacek Lawrynowicz wrote:
Hi,
On 5/28/2025 7:53 PM, Lizhi Hou wrote:
On 5/28/25 08:42, Jacek Lawrynowicz wrote:
From: Karol Wachowski <karol.wachow...@intel.com>
Trigger full device recovery when the driver fails to restore device state
via engine reset and resume operations. This is necessary because, even if
submissions from a faulty context are blocked, the NPU may still process
previously submitted faulty jobs if the engine reset fails to abort them.
Such jobs can continue to generate faults and occupy device resources.
When engine reset is ineffective, the only way to recover is to perform
a full device recovery.
Fixes: dad945c27a42 ("accel/ivpu: Add handling of
VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
Cc: <sta...@vger.kernel.org> # v6.15+
Signed-off-by: Karol Wachowski <karol.wachow...@intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynow...@linux.intel.com>
---
drivers/accel/ivpu/ivpu_job.c | 6 ++++--
drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
index 1c8e283ad9854..fae8351aa3309 100644
--- a/drivers/accel/ivpu/ivpu_job.c
+++ b/drivers/accel/ivpu/ivpu_job.c
@@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
return;
if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
- ivpu_jsm_reset_engine(vdev, 0);
+ if (ivpu_jsm_reset_engine(vdev, 0))
+ return;
Is it possible the context aborting is entered again before the full device
recovery work is executed?
This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ
and the first thing we do when triggering recovery is disabling IRQs.
The recovery work also flushes context_abort_work before staring to tear down
everything, so we should be safe.
Reviewed-by: Lizhi Hou <lizhi....@amd.com>
Regards,
Jacek