[ https://issues.apache.org/jira/browse/FLINK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085677#comment-17085677 ]
Till Rohrmann commented on FLINK-17075: --------------------------------------- 1. If the {{updateTaskExecutionState}} will be triggered from the Actor's main thread, then there should be no inconsistency because the heartbeats are also sent from the main thread. 2. I think it is fine to simply send the states of all currently running {{Tasks}}. If the JM should see that it has an {{Execution}} which is not contained in this set, then it can fail it. I think we don't have to ensure that the right state is being transmitted via the heartbeats since it mainly acts as a safety net to avoid that the job never finishes and the cluster deadlocks. I actually think that a combination of both approaches would be best. We could add as first simple fix a limited number of retries. Next, we could establish the safety net which sends the current states of all running {{Tasks}}. If the JM sees that one of its tasks is not contained in this set, then it will fail it to trigger recovery. > Add task status reconciliation between TM and JM > ------------------------------------------------ > > Key: FLINK-17075 > URL: https://issues.apache.org/jira/browse/FLINK-17075 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.11.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.11.0 > > > In order to harden the TM and JM communication I suggest to let the > {{TaskExecutor}} send the task statuses back to the {{JobMaster}} as part of > the heartbeat payload (similar to FLINK-11059). This would allow to reconcile > the states of both components in case that a status update message was lost > as described by a user on the ML. > https://lists.apache.org/thread.html/ra9ed70866381f0ef0f4779633346722ccab3dc0d6dbacce04080b74e%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)