Jiangjie Qin created FLINK-26029:
------------------------------------

             Summary: Generalize the checkpoint protocol of OperatorCoordinator.
                 Key: FLINK-26029
                 URL: https://issues.apache.org/jira/browse/FLINK-26029
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
    Affects Versions: 1.14.3
            Reporter: Jiangjie Qin


Currently the JM opens all the event valves from the OperatorCoordinator to the 
subtasks after the checkpoint barriers are sent to the Source subtasks. While 
this works for the Source Operators, it unnecessarily limits general usage of 
the OperatorCoordinator for other operators.

To generalize the protocol, we can change the JM to open the event valve of the 
subtasks that have finished the local checkpoint. So the protocol would become 
following:
 # Let the OC finish processing all the incoming OperatorEvents before the 
snapshot.
 # Wait until all the outgoing OperatorEvents before the snapshot are sent and 
acked.
 # Shut the event valve so no outgoing events can be sent to the subtasks.
 # Send checkpoint barriers to the Source operators.
 # Open the corresponding event valve of a subtask when the 
AcknowledgeCheckpoint messages from that subtask is received. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to