Flink offers a fault-tolerance mechanism to guarantee at-least-once and
exactly-once message delivery in case of failures. The mechanism works well
in practice and makes Flink stand out among stream processing systems.

But the guarantee on at-least-once and exactly-once delivery does not come
without price. It typically requires to restart multiple tasks and fall
back to the place where the last checkpoint is taken. (Fined-grained
recovery can help alleviate the cost, but it still needs certain efforts to
recover jobs.)

In some senarios, users perfer quick recovery and will trade correctness
off. For example, in some online recommendation systems, timeliness is far
more important than consistency. In such cases, we can restart only those
failed tasks individually, and do not need to perform any rollback. Though
some messages delivered to failed tasks may be lost, other tasks can
continuously provide service to users.

Many of our users are demanding for at-most-once delivery in Flink. What do
you think of the proposal? Any feedback is appreciated.

Regards,
Xiaogang Shi

Reply via email to