TM heartbeat timeout due to ResourceManager being busy

Paul Lam Sun, 11 Oct 2020 21:43:48 -0700

Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there 
would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating 
TM contexts 
on cluster initialization and HDFS is slow at that moment.


Apart from increasing the TM heartbeat timeout, is there any recommended  out 
of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts 
that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam

TM heartbeat timeout due to ResourceManager being busy

Reply via email to