tillrohrmann commented on pull request #16357: URL: https://github.com/apache/flink/pull/16357#issuecomment-877261051
Thanks for sharing your thoughts @zentol and @pltbkd. I think this is a very important discussion to have. In general we do agree that we should try to speed up Flink's detection of failed/dead TaskManagers. Currently, we only have the heartbeat mechanism for that. The current idea for the heartbeat mechanism is that the components send a heartbeat every `heartbeat.interval` ms. If one side does not receive a heartbeat for `heartbeat.timeout` ms, then it marks the other side as dead. The way it is currently implemented is that every heartbeat signal schedules a new timeout event for in `heartbeat.timeout` ms and cancels the old one. One idea for speeding the failure detection up is to decrease the default values for `interval` and `timeout`. Note that `interval < timeout`. However, it is currently unclear how low we can go because it strongly depends on the network environment stability of our users. Another idea would be to use additional signals. One example could be a signal from the underlying resource manager to tell us that a `TaskExecutor` is dead (https://issues.apache.org/jira/browse/FLINK-23216). Another signal could be whether we could actually deliver our heartbeat to the target (information from the TCP stack whether there is a connection and whether the message was lost). The connection/lost message signal is the one this PR tries to make use of. Note that we cannot solely rely on this signal because a remote `TaskExecutor` might block its main thread which makes it unresponsive but we could still deliver heartbeat requests to it (from the TCP perspective). The question is whether this signal is more predictive than others. The connection/lost message signal can also generate false-positives because there could be a transient network split. This strongly depends on the actual user environment. Therefore, my suggestion would be to introduce a threshold for lost heartbeat messages before triggering a timeout/mark the `TaskExecutor` as dead. Note that `TaskExecutors` will fail their tasks if there is a connection loss between `TaskExecutors`. Moreover, there is no retry for the lost connections. Hence, if the user runs Flink in an unstable environment, then she will sooner or later experience restarts because of the unstable network environment. Hence, one could argue that heartbeating could mark their targets as dead after a single lost heartbeat message because it won't produce more false-positives as Flink's data-layer will already produce. Of course, we can also think about more complex models where all different signals are integrated to produce a confidence value and then for some value we say that the target is now "dead". Also we can think about different retry methods (e.g. changing the request interval based on the signal), but I think that simply having a threshold for lost heartbeat messages could already improve the situation for some expert users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org