Till Rohrmann created FLINK-5667:
------------------------------------

             Summary: Possible state data loss when task fails while 
checkpointing
                 Key: FLINK-5667
                 URL: https://issues.apache.org/jira/browse/FLINK-5667
             Project: Flink
          Issue Type: Bug
          Components: State Backends, Checkpointing
    Affects Versions: 1.2.0, 1.3.0
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann
            Priority: Critical
             Fix For: 1.2.0, 1.3.0


It is possible that Flink loses state data when a {{Task}} fails while a 
checkpoint is being drawn. The scenario is the following:

Flink has finished the synchronous checkpointing part and starts the 
asynchronous part by creating and submitting a {{AsyncCheckpointRunnable}} to 
an {{Executor}}. This runnable is also registered at the closeable registry. If 
the {{Task}} now fails before the {{AsyncCheckpointRunnable}} has completed, it 
will be closed due to being registered in the closeable registry. The closing 
operation will discard all state handles and cancel all runnable state futures. 
However, it will not stop the runnable from sending an acknowledge message to 
the {{CheckpointCoordinator}}.

If this message completes the pending checkpoint, then this checkpoint will be 
transformed into a {{CompletedCheckpoint}} which is faulty (some of the data 
has already been deleted). Depending on Flink's configuration, this will 
discard older completed checkpoints and thus we will have state data loss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to