Hi, I was looking into the flink snapshotting algorithm details also mentioned here: http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/ http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html
>From other sources i understand that it assumes no failures to work for message delivery or for example a process hanging for ever: https://en.wikipedia.org/wiki/Snapshot_algorithm https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/ So my understanding (maybe wrong) is that this is a solution which seems not to address the fault tolerance issue in a strong manner like for example if it was to use a 3pc protocol for local state propagation and global agreement. I know the latter is not efficient just mentioning it for comparison. How the algorithm behaves in practical terms under the presence of its own failures (this is a background process collecting partial states)? Are there timeouts for reaching a barrier? PS. have not looked deep into the code details yet, planning to. Best, Stavros