Stephan Ewen created FLINK-16357:
------------------------------------

             Summary: Extend Checkpoint Coordinator to differentiate between 
"regional restore" and "full restore".
                 Key: FLINK-16357
                 URL: https://issues.apache.org/jira/browse/FLINK-16357
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Checkpointing
            Reporter: Stephan Ewen
             Fix For: 1.11.0


The {{ExecutionGraph}} has the notion of "global failure" (failing the entire 
execution graph) and "regional failure" (recover a region with transient 
pipelined data exchanges).
The latter one is for common failover, the former one is a safety net to handle 
unexpected failures or inconsistencies (full reset of ExecutionGraph recovers 
most inconsistencies).

The OperatorCoordinators should only be reset to a checkpoint in the "global 
failover" case. In the "regional failover" case, they are only notified of the 
tasks that are reset and keep their internal state and adjust it for the failed 
tasks.

To implement that, the ExecutionGraph needs to forward the information about 
whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to