On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote:
Hi,
On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
From: Karol Wachowski <karol.wachow...@intel.com>
Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
The enhancement aims to improve the reliability of device hang detection by
monitoring heartbeat updates.
Each progressing inference will update heartbeat counter allowing driver to
monitor its progression. Limit maximum number of reschedules when heartbeat
indicates progression to 30.
Code looks good. However, why 30? This would artificially limit how long a
job could run, no?
Yes, we still need a time based limit. There may be workloads that are stuck in
infinite loop for example.
With this patch the max time the job can run is extended from 2 to 60 seconds.
We are not aware of any workloads that exceed this timeout at the moment.
Infinite loop vs something that just happens to be running long by
design is a difficult problem. 60 seconds does not seem all that long
to me. Perhaps consider some kind of override so that if/when a
workload comes along that needs more than 60 seconds, the user doesn't
need to recompile their kernel to make it work? I suspect that would be
outside the scope of this change.
For this change, I think it would be good to add some info from your
response. I think the commit text would be improved by stating this
increases the max runtime from 2 seconds to 60, and that this covers all
known workloads. Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT that
tells how long the limit is (60 seconds) would be helpful to future
readers, instead of needing to parse through multiple functions and how
they all interact.
With the commit text update -
Reviewed-by: Jeff Hugo <jeff.h...@oss.qualcomm.com>
The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.