[jira] [Updated] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

Gyula Fora (JIRA) Mon, 27 Jun 2016 01:19:37 -0700

     [ 
https://issues.apache.org/jira/browse/FLINK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gyula Fora updated FLINK-4120:
------------------------------
    Component/s:     (was: Core)
                 State Backends, Checkpointing

> Lightweight fault tolerance through recomputing lost state
> ----------------------------------------------------------
>
>                 Key: FLINK-4120
>                 URL: https://issues.apache.org/jira/browse/FLINK-4120
>             Project: Flink
>          Issue Type: New Feature
>          Components: State Backends, Checkpointing
>            Reporter: Dénes Vadász
>
> The current fault tolerance mechanism requires that stateful operators write 
> their internal state to stable storage during a checkpoint. 
> This proposal aims at optimizing out this operation in the cases where the 
> operator state can be recomputed from a finite (practically: small) set of 
> source records, and those records are already on checkpoint-aware persistent 
> storage (e.g. in Kafka). 
> The rationale behind the proposal is that the cost of reprocessing is paid 
> only on recovery from (presumably rare) failures, whereas the cost of 
> persisting the state is paid on every checkpoint. Eliminating the need for 
> persistent storage will also simplify system setup and operation.
> In the cases where this optimisation is applicable, the state of the 
> operators can be restored by restarting the pipeline from a checkpoint taken 
> before the pipeline ingested any of the records required for the state 
> re-computation of the operators (call this the "null-state checkpoint"), as 
> opposed to a restart from the "latest checkpoint". 
> The "latest checkpoint" is still relevant for the recovery: the barriers 
> belonging to that checkpoint must be inserted into the source streams in the 
> position they were originally inserted. Sinks must discard all records until 
> this barrier reaches them.
> Note the inherent relationship between the "latest" and the "null-state" 
> checkpoints: the pipeline must be restarted from the latter to restore the 
> state at the former.
> For the stateful operators for which this optimization is applicable we can 
> define the notion of "current null-state watermark" as the watermark such 
> that the operator can correctly (re)compute its current state merely from 
> records after this watermark. 
>  
> For the checkpoint-coordinator to be able to compute the null-state 
> checkpoint, each stateful operator should report its "current null-state 
> watermark" as part of acknowledging the ongoing checkpoint. The null-state 
> checkpoint of the ongoing checkpoint is the most recent checkpoint preceding 
> all the received null-state watermarks (assuming the pipeline preserves the 
> relative order of barriers and watermarks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

Reply via email to