[ https://issues.apache.org/jira/browse/FLINK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084188#comment-17084188 ]
Till Rohrmann commented on FLINK-17075: --------------------------------------- I think this could work. The downside I see is that it would complicate the {{updateTaskExecutionState}} method. Depending on the reported state one either retries or not. Additionally, we would also have to manage the retry operations because in case of a heartbeat failure we should stop them as they should not continue indefinitely. > Add task status reconciliation between TM and JM > ------------------------------------------------ > > Key: FLINK-17075 > URL: https://issues.apache.org/jira/browse/FLINK-17075 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.11.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.11.0 > > > In order to harden the TM and JM communication I suggest to let the > {{TaskExecutor}} send the task statuses back to the {{JobMaster}} as part of > the heartbeat payload (similar to FLINK-11059). This would allow to reconcile > the states of both components in case that a status update message was lost > as described by a user on the ML. > https://lists.apache.org/thread.html/ra9ed70866381f0ef0f4779633346722ccab3dc0d6dbacce04080b74e%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)