resume failure

Jacek Lawrynowicz Mon, 02 Jun 2025 06:05:55 -0700

Hi,

On 5/28/2025 7:53 PM, Lizhi Hou wrote:
> 
> On 5/28/25 08:42, Jacek Lawrynowicz wrote:
>> From: Karol Wachowski <karol.wachow...@intel.com>
>>
>> Trigger full device recovery when the driver fails to restore device state
>> via engine reset and resume operations. This is necessary because, even if
>> submissions from a faulty context are blocked, the NPU may still process
>> previously submitted faulty jobs if the engine reset fails to abort them.
>> Such jobs can continue to generate faults and occupy device resources.
>> When engine reset is ineffective, the only way to recover is to perform
>> a full device recovery.
>>
>> Fixes: dad945c27a42 ("accel/ivpu: Add handling of 
>> VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
>> Cc: <sta...@vger.kernel.org> # v6.15+
>> Signed-off-by: Karol Wachowski <karol.wachow...@intel.com>
>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynow...@linux.intel.com>
>> ---
>>   drivers/accel/ivpu/ivpu_job.c     | 6 ++++--
>>   drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
>>   2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
>> index 1c8e283ad9854..fae8351aa3309 100644
>> --- a/drivers/accel/ivpu/ivpu_job.c
>> +++ b/drivers/accel/ivpu/ivpu_job.c
>> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>>           return;
>>         if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
>> -        ivpu_jsm_reset_engine(vdev, 0);
>> +        if (ivpu_jsm_reset_engine(vdev, 0))
>> +            return;
> 
> Is it possible the context aborting is entered again before the full device 
> recovery work is executed?


This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ 
and the first thing we do when triggering recovery is disabling IRQs.
The recovery work also flushes context_abort_work before staring to tear down 
everything, so we should be safe.

Regards,
Jacek

Re: [PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure

Reply via email to