Hi, After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts on cluster initialization and HDFS is slow at that moment.
Apart from increasing the TM heartbeat timeout, is there any recommended out of the box approach that can reduce the chance of getting the timeouts? In the long run, is it possible to limit the number of taskmanager contexts that RM creates at a time, so that the heartbeat triggers can chime in? Thanks! Best, Paul Lam