On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote:
Hi,

On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
From: Karol Wachowski <karol.wachow...@intel.com>

Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
The enhancement aims to improve the reliability of device hang detection by
monitoring heartbeat updates.

Each progressing inference will update heartbeat counter allowing driver to
monitor its progression. Limit maximum number of reschedules when heartbeat
indicates progression to 30.

Code looks good.  However, why 30?  This would artificially limit how long a 
job could run, no?

Yes, we still need a time based limit. There may be workloads that are stuck in 
infinite loop for example.
With this patch the max time the job can run is extended from 2 to 60 seconds.
We are not aware of any workloads that exceed this timeout at the moment.

Infinite loop vs something that just happens to be running long by design is a difficult problem. 60 seconds does not seem all that long to me. Perhaps consider some kind of override so that if/when a workload comes along that needs more than 60 seconds, the user doesn't need to recompile their kernel to make it work? I suspect that would be outside the scope of this change.

For this change, I think it would be good to add some info from your response. I think the commit text would be improved by stating this increases the max runtime from 2 seconds to 60, and that this covers all known workloads. Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT that tells how long the limit is (60 seconds) would be helpful to future readers, instead of needing to parse through multiple functions and how they all interact.

With the commit text update -

Reviewed-by: Jeff Hugo <jeff.h...@oss.qualcomm.com>

The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.

Reply via email to