Flink offers a fault-tolerance mechanism to guarantee at-least-once and exactly-once message delivery in case of failures. The mechanism works well in practice and makes Flink stand out among stream processing systems.
But the guarantee on at-least-once and exactly-once delivery does not come without price. It typically requires to restart multiple tasks and fall back to the place where the last checkpoint is taken. (Fined-grained recovery can help alleviate the cost, but it still needs certain efforts to recover jobs.) In some senarios, users perfer quick recovery and will trade correctness off. For example, in some online recommendation systems, timeliness is far more important than consistency. In such cases, we can restart only those failed tasks individually, and do not need to perform any rollback. Though some messages delivered to failed tasks may be lost, other tasks can continuously provide service to users. Many of our users are demanding for at-most-once delivery in Flink. What do you think of the proposal? Any feedback is appreciated. Regards, Xiaogang Shi