flink snapshotting fault-tolerance

Hi,

I was looking into the flink snapshotting algorithm details also mentioned
here:
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html


>From other sources i understand that it assumes no failures to work for
message delivery or for example a process hanging for ever:
https://en.wikipedia.org/wiki/Snapshot_algorithm
https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/

So my understanding (maybe wrong) is that this is a solution which seems
not to address the fault tolerance issue in a strong manner like for
example if it was to use a 3pc protocol for local state propagation and
global agreement. I know the latter is not efficient just mentioning it for
comparison.

How the algorithm behaves in practical terms under the presence of its own
failures (this is a background process collecting partial states)? Are
there timeouts for reaching a barrier?

PS. have not looked deep into the code details yet, planning to.

Best,
Stavros

flink snapshotting fault-tolerance

Reply via email to