Hi,

On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
> On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
>> From: Karol Wachowski <karol.wachow...@intel.com>
>>
>> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
>> The enhancement aims to improve the reliability of device hang detection by
>> monitoring heartbeat updates.
>>
>> Each progressing inference will update heartbeat counter allowing driver to
>> monitor its progression. Limit maximum number of reschedules when heartbeat
>> indicates progression to 30.
> 
> Code looks good.  However, why 30?  This would artificially limit how long a 
> job could run, no?

Yes, we still need a time based limit. There may be workloads that are stuck in 
infinite loop for example.
With this patch the max time the job can run is extended from 2 to 60 seconds.
We are not aware of any workloads that exceed this timeout at the moment.

Regards,
Jacek

Reply via email to