[ https://issues.apache.org/jira/browse/FLINK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083989#comment-17083989 ]
Zhu Zhu commented on FLINK-17075: --------------------------------- I mean to not add a limit to the retry count. If the retry keeps failing, then the heartbeat notification should fail as well. So heartbeat timeout handling at JM side would also help to trigger a failover in this case, and the TM would not keep retrying indefinitely. > Add task status reconciliation between TM and JM > ------------------------------------------------ > > Key: FLINK-17075 > URL: https://issues.apache.org/jira/browse/FLINK-17075 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.11.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.11.0 > > > In order to harden the TM and JM communication I suggest to let the > {{TaskExecutor}} send the task statuses back to the {{JobMaster}} as part of > the heartbeat payload (similar to FLINK-11059). This would allow to reconcile > the states of both components in case that a status update message was lost > as described by a user on the ML. > https://lists.apache.org/thread.html/ra9ed70866381f0ef0f4779633346722ccab3dc0d6dbacce04080b74e%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)