Hi, On 4/18/2025 5:27 PM, Jeffrey Hugo wrote: > On 4/16/2025 4:25 AM, Maciej Falkowski wrote: >> From: Karol Wachowski <karol.wachow...@intel.com> >> >> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism. >> The enhancement aims to improve the reliability of device hang detection by >> monitoring heartbeat updates. >> >> Each progressing inference will update heartbeat counter allowing driver to >> monitor its progression. Limit maximum number of reschedules when heartbeat >> indicates progression to 30. > > Code looks good. However, why 30? This would artificially limit how long a > job could run, no?
Yes, we still need a time based limit. There may be workloads that are stuck in infinite loop for example. With this patch the max time the job can run is extended from 2 to 60 seconds. We are not aware of any workloads that exceed this timeout at the moment. Regards, Jacek