[ https://issues.apache.org/jira/browse/FLINK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083964#comment-17083964 ]
Till Rohrmann commented on FLINK-17075: --------------------------------------- I agree that retries will decrease the likelihood of missed messages. However, I don't fully understand why this will guarantee that the JM and TM won't go out of sync. Can't it still happen that all retries (assuming finite number of retries) for a final state update will fail and, hence, the JM will never learn about the terminal state? Also failing the {{Task}} on the {{TaskExecutor}} won't guarantee that the {{FAILING}} state transition will be sent to the JM (it will simply increase the number of retries of {{updateTaskExecutionState}}). Note that if the heartbeats fail, then the JM will disconnect from the TM which has the effect to fail all {{Executions}}. I think the situation we are discussing here is that there is a short-lived network problem which can be recovered by Akka and does not trigger the heartbeat to fail. > Add task status reconciliation between TM and JM > ------------------------------------------------ > > Key: FLINK-17075 > URL: https://issues.apache.org/jira/browse/FLINK-17075 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.11.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.11.0 > > > In order to harden the TM and JM communication I suggest to let the > {{TaskExecutor}} send the task statuses back to the {{JobMaster}} as part of > the heartbeat payload (similar to FLINK-11059). This would allow to reconcile > the states of both components in case that a status update message was lost > as described by a user on the ML. > https://lists.apache.org/thread.html/ra9ed70866381f0ef0f4779633346722ccab3dc0d6dbacce04080b74e%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)