Hi Stravos, I haven't implemented our checkpointing mechanism and I didn't participate in the design decisions while implementing it, so I can not compare it in detail to other approaches.
>From a "does it work perspective": Checkpoints are only confirmed if all parallel subtasks successfully created a valid snapshot of the state. So if there is a failure in the checkpointing mechanism, no valid checkpoint will be created. The system will recover from the last valid checkpoint. There is a timeout for checkpoints. So if a barrier doesn't pass through the system for a certain period of time, the checkpoint is cancelled. The default timeout is 10 minutes. Regards, Robert On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos < st.kontopou...@gmail.com> wrote: > Hi, > > I was looking into the flink snapshotting algorithm details also mentioned > here: > > http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ > > https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/ > > http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html > > From other sources i understand that it assumes no failures to work for > message delivery or for example a process hanging for ever: > https://en.wikipedia.org/wiki/Snapshot_algorithm > > https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/ > > So my understanding (maybe wrong) is that this is a solution which seems > not to address the fault tolerance issue in a strong manner like for > example if it was to use a 3pc protocol for local state propagation and > global agreement. I know the latter is not efficient just mentioning it for > comparison. > > How the algorithm behaves in practical terms under the presence of its own > failures (this is a background process collecting partial states)? Are > there timeouts for reaching a barrier? > > PS. have not looked deep into the code details yet, planning to. > > Best, > Stavros > >