+1 from my side to support this feature in Flink. Best, Vino
Biao Liu <mmyy1...@gmail.com> 于2019年6月11日周二 下午6:14写道: > Hi Piotrek, > I agree with you that there are strained resources of community to support > such a feature. I was planing to start a similar discussion after 1.9 > released. Anyway we don't have enough time to support this feature now, but > I think a discussion is fine. > It's very interesting of your checkpoint semantic question. I think it is > worth to support however it might not be a small modification. > > There is also a big gap need to discuss. Currently the network error > handling is tightly coupled with task failover strategy. There is a typical > scenario, if a TM is crashed, all the tasks of TMs connected with the > failed TM would fail automatically. In our internal implementation, this is > the biggest part to support Best-effort failover strategy. > > > Piotr Nowojski <pi...@ververica.com> 于2019年6月11日周二 下午5:31写道: > > > Hi Xiaogang, > > > > It sounds interesting and definitely a useful feature, however the > > questions for me would be how useful, how much effort would it require > and > > is it worth it? We simply can not do all things at once, and currently > > people that could review/drive/mentor this effort are pretty much > strained > > :( For me one would have to investigate answers to those questions and > > prioritise it compared to other ongoing efforts, before I could vote +1 > for > > this. > > > > Couple of things to consider: > > - would it be only a job manager/failure region recovery feature? > > - would it require changes in CheckpointBarrierHandler, > > CheckpointCoordinator classes? > > - with `at-most-once` semantic theoretically speaking we could just drop > > the current `CheckpointBarrier` handling/injecting code and avoid all of > > the checkpoint alignment issues - we could just checkpoint all of the > tasks > > independently of one another. However maybe that could be a follow up > > optimisation step? > > > > Piotrek > > > > > On 11 Jun 2019, at 10:53, Zili Chen <wander4...@gmail.com> wrote: > > > > > > Hi Xiaogang, > > > > > > It is an interesting topic. > > > > > > Notice that there is some effort to build a mature mllib of flink these > > > days, it could be also possible for some ml cases trade off correctness > > for > > > timeliness or throughput. Excatly-once delivery excatly makes flink > stand > > > out but an at-most-once option would adapt flink to more scenarios. > > > > > > Best, > > > tison. > > > > > > > > > SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道: > > > > > >> Flink offers a fault-tolerance mechanism to guarantee at-least-once > and > > >> exactly-once message delivery in case of failures. The mechanism works > > well > > >> in practice and makes Flink stand out among stream processing systems. > > >> > > >> But the guarantee on at-least-once and exactly-once delivery does not > > come > > >> without price. It typically requires to restart multiple tasks and > fall > > >> back to the place where the last checkpoint is taken. (Fined-grained > > >> recovery can help alleviate the cost, but it still needs certain > > efforts to > > >> recover jobs.) > > >> > > >> In some senarios, users perfer quick recovery and will trade > correctness > > >> off. For example, in some online recommendation systems, timeliness is > > far > > >> more important than consistency. In such cases, we can restart only > > those > > >> failed tasks individually, and do not need to perform any rollback. > > Though > > >> some messages delivered to failed tasks may be lost, other tasks can > > >> continuously provide service to users. > > >> > > >> Many of our users are demanding for at-most-once delivery in Flink. > > What do > > >> you think of the proposal? Any feedback is appreciated. > > >> > > >> Regards, > > >> Xiaogang Shi > > >> > > > > >