tillrohrmann commented on pull request #16357:
URL: https://github.com/apache/flink/pull/16357#issuecomment-877261051


   Thanks for sharing your thoughts @zentol and @pltbkd. I think this is a very 
important discussion to have. In general we do agree that we should try to 
speed up Flink's detection of failed/dead TaskManagers. 
   
   Currently, we only have the heartbeat mechanism for that. The current idea 
for the heartbeat mechanism is that the components send a heartbeat every 
`heartbeat.interval` ms. If one side does not receive a heartbeat for 
`heartbeat.timeout` ms, then it marks the other side as dead.
   
   The way it is currently implemented is that every heartbeat signal schedules 
a new timeout event for in `heartbeat.timeout` ms and cancels the old one.
   
   One idea for speeding the failure detection up is to decrease the default 
values for `interval` and `timeout`. Note that `interval < timeout`. However, 
it is currently unclear how low we can go because it strongly depends on the 
network environment stability of our users.
   
   Another idea would be to use additional signals. One example could be a 
signal from the underlying resource manager to tell us that a `TaskExecutor` is 
dead (https://issues.apache.org/jira/browse/FLINK-23216). Another signal could 
be whether we could actually deliver our heartbeat to the target (information 
from the TCP stack whether there is a connection and whether the message was 
lost).
   
   The connection/lost message signal is the one this PR tries to make use of. 
Note that we cannot solely rely on this signal because a remote `TaskExecutor` 
might block its main thread which makes it unresponsive but we could still 
deliver heartbeat requests to it (from the TCP perspective). The question is 
whether this signal is more predictive than others.
   
   The connection/lost message signal can also generate false-positives because 
there could be a transient network split. This strongly depends on the actual 
user environment. Therefore, my suggestion would be to introduce a threshold 
for lost heartbeat messages before triggering a timeout/mark the `TaskExecutor` 
as dead.
   
   Note that `TaskExecutors` will fail their tasks if there is a connection 
loss between `TaskExecutors`. Moreover, there is no retry for the lost 
connections. Hence, if the user runs Flink in an unstable environment, then she 
will sooner or later experience restarts because of the unstable network 
environment. Hence, one could argue that heartbeating could mark their targets 
as dead after a single lost heartbeat message because it won't produce more 
false-positives as Flink's data-layer will already produce.
   
   Of course, we can also think about more complex models where all different 
signals are integrated to produce a confidence value and then for some value we 
say that the target is now "dead". Also we can think about different retry 
methods (e.g. changing the request interval based on the signal), but I think 
that simply having a threshold for lost heartbeat messages could already 
improve the situation for some expert users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to