Hi Xiaogang, It sounds interesting and definitely a useful feature, however the questions for me would be how useful, how much effort would it require and is it worth it? We simply can not do all things at once, and currently people that could review/drive/mentor this effort are pretty much strained :( For me one would have to investigate answers to those questions and prioritise it compared to other ongoing efforts, before I could vote +1 for this.
Couple of things to consider: - would it be only a job manager/failure region recovery feature? - would it require changes in CheckpointBarrierHandler, CheckpointCoordinator classes? - with `at-most-once` semantic theoretically speaking we could just drop the current `CheckpointBarrier` handling/injecting code and avoid all of the checkpoint alignment issues - we could just checkpoint all of the tasks independently of one another. However maybe that could be a follow up optimisation step? Piotrek > On 11 Jun 2019, at 10:53, Zili Chen <wander4...@gmail.com> wrote: > > Hi Xiaogang, > > It is an interesting topic. > > Notice that there is some effort to build a mature mllib of flink these > days, it could be also possible for some ml cases trade off correctness for > timeliness or throughput. Excatly-once delivery excatly makes flink stand > out but an at-most-once option would adapt flink to more scenarios. > > Best, > tison. > > > SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道: > >> Flink offers a fault-tolerance mechanism to guarantee at-least-once and >> exactly-once message delivery in case of failures. The mechanism works well >> in practice and makes Flink stand out among stream processing systems. >> >> But the guarantee on at-least-once and exactly-once delivery does not come >> without price. It typically requires to restart multiple tasks and fall >> back to the place where the last checkpoint is taken. (Fined-grained >> recovery can help alleviate the cost, but it still needs certain efforts to >> recover jobs.) >> >> In some senarios, users perfer quick recovery and will trade correctness >> off. For example, in some online recommendation systems, timeliness is far >> more important than consistency. In such cases, we can restart only those >> failed tasks individually, and do not need to perform any rollback. Though >> some messages delivered to failed tasks may be lost, other tasks can >> continuously provide service to users. >> >> Many of our users are demanding for at-most-once delivery in Flink. What do >> you think of the proposal? Any feedback is appreciated. >> >> Regards, >> Xiaogang Shi >>