Re: [PATCH] accel/ivpu: Implement heartbeat-based TDR mechanism

Jeff Hugo Wed, 23 Apr 2025 07:23:59 -0700

On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote:

Hi,


On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:

On 4/16/2025 4:25 AM, Maciej Falkowski wrote:

From: Karol Wachowski <karol.wachow...@intel.com>

Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
The enhancement aims to improve the reliability of device hang detection by
monitoring heartbeat updates.

Each progressing inference will update heartbeat counter allowing driver to
monitor its progression. Limit maximum number of reschedules when heartbeat
indicates progression to 30.


Code looks good.  However, why 30?  This would artificially limit how long a 
job could run, no?


Yes, we still need a time based limit. There may be workloads that are stuck in 
infinite loop for example.
With this patch the max time the job can run is extended from 2 to 60 seconds.
We are not aware of any workloads that exceed this timeout at the moment.

Infinite loop vs something that just happens to be running long bydesign is a difficult problem. 60 seconds does not seem all that longto me. Perhaps consider some kind of override so that if/when aworkload comes along that needs more than 60 seconds, the user doesn'tneed to recompile their kernel to make it work? I suspect that would beoutside the scope of this change.

For this change, I think it would be good to add some info from yourresponse. I think the commit text would be improved by stating thisincreases the max runtime from 2 seconds to 60, and that this covers allknown workloads. Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT thattells how long the limit is (60 seconds) would be helpful to futurereaders, instead of needing to parse through multiple functions and howthey all interact.


With the commit text update -

Reviewed-by: Jeff Hugo <jeff.h...@oss.qualcomm.com>

The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.

Re: [PATCH] accel/ivpu: Implement heartbeat-based TDR mechanism

Reply via email to