Till Rohrmann created FLINK-1581: ------------------------------------ Summary: Configure DeathWatch parameters properly Key: FLINK-1581 URL: https://issues.apache.org/jira/browse/FLINK-1581 Project: Flink Issue Type: Bug Reporter: Till Rohrmann
We are using Akka's DeathWath mechanism to detect failed components. However, the interval until an {{Instance}} is marked dead is currently very long. Especially, in conjunction with the job restarting mechanism we should devise a mechanism which either quickly detects dead {{Instance}}s or set the interval, pause and threshold values such that the detection does not take longer than the Akka ask timeout value. Otherwise, all retries might be consumed before an {{Instance}} is recognized being dead. Further investigation of the correct failure behavior is necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)