Thanks @Till Rohrmann <>  for starting this discussion

Firstly, I try to understand the benefit of shorter heartbeat timeout.
IIUC, it will make the JobManager aware of
TaskManager faster. However, it seems that only the standalone cluster
could benefit from this. For Yarn and
native Kubernetes deployment, the Flink ResourceManager should get the
TaskManager lost event in a very short time.

* About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
* Less than 1 second, Flink RM has a watch for all the TaskManager pods

Secondly, I am not very confident to decrease the timeout to 15s. I have
quickly checked the TaskManager GC logs
in the past week of our internal Flink workloads and find more than 100
10-seconds Full GC logs, but no one is bigger than 15s.
We are using CMS GC for old generation.


Till Rohrmann <> 于2021年7月17日周六 上午1:05写道:

> Hi everyone,
> Since Flink 1.5 we have the same heartbeat timeout and interval default
> values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
> 10s. These values were mainly chosen to compensate for lengthy GC pauses
> and blocking operations that were executed in the main threads of Flink's
> components. Since then, there were quite some advancements wrt the JVM's
> GCs and we also got rid of a lot of blocking calls that were executed in
> the main thread. Moreover, a long heartbeat.timeout causes long recovery
> times in case of a TaskManager loss because the system can only properly
> recover after the dead TaskManager has been removed from the scheduler.
> Hence, I wanted to propose to change the timeout and interval to:
> heartbeat.timeout: 15s
> heartbeat.interval: 3s
> Since there is no perfect solution that fits all use cases, I would really
> like to hear from you what you think about it and how you configure these
> heartbeat options. Based on your experience we might actually come up with
> better default values that allow us to be resilient but also to detect
> failed components fast. FLIP-185 can be found here [1].
> [1]
> Cheers,
> Till

Reply via email to