Hi all! We just published a blog post about how streaming fault tolerance mechanisms evolved, and what kind of performance Flink gets with its checkpointing mechanism.
I think it is a pretty interesting read for people that are interested in Flink or data streaming in general. The blog post talks about: - Fault tolerance techniques, starting from acknowledgements, over micro batches, to transactional updates and distributed snapshots. - Performance of Flink, throughput, latency, and tradeoffs. - A "chaos monkey" experiment where computation continues strongly consistent even when periodically killing workers. Comments welcome! Greetings, Stephan