[ https://issues.apache.org/jira/browse/FLINK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jiangjie Qin updated FLINK-16986: --------------------------------- Parent: FLINK-10740 Issue Type: Sub-task (was: Bug) > Enhance the OperatorEvent handling guarantee during checkpointing. > ------------------------------------------------------------------ > > Key: FLINK-16986 > URL: https://issues.apache.org/jira/browse/FLINK-16986 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing > Reporter: Jiangjie Qin > Priority: Major > > When the {{CheckpointCoordinator}} takes a checkpoint, the checkpointing > order is following: > # {{CheckpointCoordinator}} triggers checkpoint on each > {{OperatorCoordinator}} > # Each {{OperatorCoordinator}} takes a snapshot. > # Right after taking the snapshot, the {{CheckpointCoordinator}} sends a > {{CHECKPOINT_FIN}} marker through the {{OperatorContext}}. > # Once the {{OperatorContext}} sees {{CHECKPOINT_FIN}} marker, it will wait > for all the previous events are acked and suspend the event gateway to the > operators by buffering the future {{OperatorEvents}} sent from the > {{OperatorCoordinator}} to the operators without actually sending them out. > # The {{CheckpointCoordinator}} waits until all the {{OperatorCoordinator}}s > finish step 2-4 and then triggers the task snapshots. > # The suspension of an event gateway to an operator can be lifted after all > the subtasks of that operator has finished their task checkpoint. > The mechanism above guarantees all the {{OperatorEvents}} sent before taking > the operator coordinator snapshot are handled by the operator before the task > snapshots are taken. > An operator can use this mechanism to know whether an {{OperatorEvent}} it > sent to the coordinator is included in the upcoming checkpoint or not. What > it has to do is to ask the operator coordinator to ACK that OperatorEvent. If > the ACK is received before the operator takes the next snapshot, that > OperatorEvent must have been handled and checkpointed by the > OperatorCoordinator. -- This message was sent by Atlassian Jira (v8.3.4#803005)