On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
From: Karol Wachowski <karol.wachow...@intel.com>
Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
The enhancement aims to improve the reliability of device hang detection by
monitoring heartbeat updates.
Each progressing inference will update heartbeat counter allowing driver to
monitor its progression. Limit maximum number of reschedules when heartbeat
indicates progression to 30.
Code looks good. However, why 30? This would artificially limit how
long a job could run, no?
-Jeff