Re: [DISCUSS] Allow at-most-once delivery in case of failures

Biao Liu Tue, 11 Jun 2019 03:15:03 -0700

Hi Piotrek,
I agree with you that there are strained resources of community to support
such a feature. I was planing to start a similar discussion after 1.9
released. Anyway we don't have enough time to support this feature now, but
I think a discussion is fine.
It's very interesting of your checkpoint semantic question. I think it is
worth to support however it might not be a small modification.


There is also a big gap need to discuss. Currently the network error
handling is tightly coupled with task failover strategy. There is a typical
scenario, if a TM is crashed, all the tasks of TMs connected with the
failed TM would fail automatically. In our internal implementation, this is
the biggest part to support Best-effort failover strategy.


Piotr Nowojski <pi...@ververica.com> 于2019年6月11日周二 下午5:31写道：

> Hi Xiaogang,
>
> It sounds interesting and definitely a useful feature, however the
> questions for me would be how useful, how much effort would it require and
> is it worth it? We simply can not do all things at once, and currently
> people that could review/drive/mentor this effort are pretty much strained
> :( For me one would have to investigate answers to those questions and
> prioritise it compared to other ongoing efforts, before I could vote +1 for
> this.
>
> Couple of things to consider:
> - would it be only a job manager/failure region recovery feature?
> - would it require changes in CheckpointBarrierHandler,
> CheckpointCoordinator classes?
> - with `at-most-once` semantic theoretically speaking we could just drop
> the current `CheckpointBarrier` handling/injecting code and avoid all of
> the checkpoint alignment issues - we could just checkpoint all of the tasks
> independently of one another. However maybe that could be a follow up
> optimisation step?
>
> Piotrek
>
> > On 11 Jun 2019, at 10:53, Zili Chen <wander4...@gmail.com> wrote:
> >
> > Hi Xiaogang,
> >
> > It is an interesting topic.
> >
> > Notice that there is some effort to build a mature mllib of flink these
> > days, it could be also possible for some ml cases trade off correctness
> for
> > timeliness or throughput. Excatly-once delivery excatly makes flink stand
> > out but an at-most-once option would adapt flink to more scenarios.
> >
> > Best,
> > tison.
> >
> >
> > SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道：
> >
> >> Flink offers a fault-tolerance mechanism to guarantee at-least-once and
> >> exactly-once message delivery in case of failures. The mechanism works
> well
> >> in practice and makes Flink stand out among stream processing systems.
> >>
> >> But the guarantee on at-least-once and exactly-once delivery does not
> come
> >> without price. It typically requires to restart multiple tasks and fall
> >> back to the place where the last checkpoint is taken. (Fined-grained
> >> recovery can help alleviate the cost, but it still needs certain
> efforts to
> >> recover jobs.)
> >>
> >> In some senarios, users perfer quick recovery and will trade correctness
> >> off. For example, in some online recommendation systems, timeliness is
> far
> >> more important than consistency. In such cases, we can restart only
> those
> >> failed tasks individually, and do not need to perform any rollback.
> Though
> >> some messages delivered to failed tasks may be lost, other tasks can
> >> continuously provide service to users.
> >>
> >> Many of our users are demanding for at-most-once delivery in Flink.
> What do
> >> you think of the proposal? Any feedback is appreciated.
> >>
> >> Regards,
> >> Xiaogang Shi
> >>
>
>

Re: [DISCUSS] Allow at-most-once delivery in case of failures

Reply via email to