Hi, On 5/28/2025 7:53 PM, Lizhi Hou wrote: > > On 5/28/25 08:42, Jacek Lawrynowicz wrote: >> From: Karol Wachowski <karol.wachow...@intel.com> >> >> Trigger full device recovery when the driver fails to restore device state >> via engine reset and resume operations. This is necessary because, even if >> submissions from a faulty context are blocked, the NPU may still process >> previously submitted faulty jobs if the engine reset fails to abort them. >> Such jobs can continue to generate faults and occupy device resources. >> When engine reset is ineffective, the only way to recover is to perform >> a full device recovery. >> >> Fixes: dad945c27a42 ("accel/ivpu: Add handling of >> VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW") >> Cc: <sta...@vger.kernel.org> # v6.15+ >> Signed-off-by: Karol Wachowski <karol.wachow...@intel.com> >> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynow...@linux.intel.com> >> --- >> drivers/accel/ivpu/ivpu_job.c | 6 ++++-- >> drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++-- >> 2 files changed, 11 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c >> index 1c8e283ad9854..fae8351aa3309 100644 >> --- a/drivers/accel/ivpu/ivpu_job.c >> +++ b/drivers/accel/ivpu/ivpu_job.c >> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work) >> return; >> if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW) >> - ivpu_jsm_reset_engine(vdev, 0); >> + if (ivpu_jsm_reset_engine(vdev, 0)) >> + return; > > Is it possible the context aborting is entered again before the full device > recovery work is executed?
This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ and the first thing we do when triggering recovery is disabling IRQs. The recovery work also flushes context_abort_work before staring to tear down everything, so we should be safe. Regards, Jacek