[ https://issues.apache.org/jira/browse/FLINK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gyula Fora updated FLINK-4120: ------------------------------ Component/s: (was: Core) State Backends, Checkpointing > Lightweight fault tolerance through recomputing lost state > ---------------------------------------------------------- > > Key: FLINK-4120 > URL: https://issues.apache.org/jira/browse/FLINK-4120 > Project: Flink > Issue Type: New Feature > Components: State Backends, Checkpointing > Reporter: Dénes Vadász > > The current fault tolerance mechanism requires that stateful operators write > their internal state to stable storage during a checkpoint. > This proposal aims at optimizing out this operation in the cases where the > operator state can be recomputed from a finite (practically: small) set of > source records, and those records are already on checkpoint-aware persistent > storage (e.g. in Kafka). > The rationale behind the proposal is that the cost of reprocessing is paid > only on recovery from (presumably rare) failures, whereas the cost of > persisting the state is paid on every checkpoint. Eliminating the need for > persistent storage will also simplify system setup and operation. > In the cases where this optimisation is applicable, the state of the > operators can be restored by restarting the pipeline from a checkpoint taken > before the pipeline ingested any of the records required for the state > re-computation of the operators (call this the "null-state checkpoint"), as > opposed to a restart from the "latest checkpoint". > The "latest checkpoint" is still relevant for the recovery: the barriers > belonging to that checkpoint must be inserted into the source streams in the > position they were originally inserted. Sinks must discard all records until > this barrier reaches them. > Note the inherent relationship between the "latest" and the "null-state" > checkpoints: the pipeline must be restarted from the latter to restore the > state at the former. > For the stateful operators for which this optimization is applicable we can > define the notion of "current null-state watermark" as the watermark such > that the operator can correctly (re)compute its current state merely from > records after this watermark. > > For the checkpoint-coordinator to be able to compute the null-state > checkpoint, each stateful operator should report its "current null-state > watermark" as part of acknowledging the ongoing checkpoint. The null-state > checkpoint of the ongoing checkpoint is the most recent checkpoint preceding > all the received null-state watermarks (assuming the pipeline preserves the > relative order of barriers and watermarks). -- This message was sent by Atlassian JIRA (v6.3.4#6332)