+1 from my side to support this feature in Flink.

Best,
Vino

Biao Liu <mmyy1...@gmail.com> 于2019年6月11日周二 下午6:14写道:

> Hi Piotrek,
> I agree with you that there are strained resources of community to support
> such a feature. I was planing to start a similar discussion after 1.9
> released. Anyway we don't have enough time to support this feature now, but
> I think a discussion is fine.
> It's very interesting of your checkpoint semantic question. I think it is
> worth to support however it might not be a small modification.
>
> There is also a big gap need to discuss. Currently the network error
> handling is tightly coupled with task failover strategy. There is a typical
> scenario, if a TM is crashed, all the tasks of TMs connected with the
> failed TM would fail automatically. In our internal implementation, this is
> the biggest part to support Best-effort failover strategy.
>
>
> Piotr Nowojski <pi...@ververica.com> 于2019年6月11日周二 下午5:31写道:
>
> > Hi Xiaogang,
> >
> > It sounds interesting and definitely a useful feature, however the
> > questions for me would be how useful, how much effort would it require
> and
> > is it worth it? We simply can not do all things at once, and currently
> > people that could review/drive/mentor this effort are pretty much
> strained
> > :( For me one would have to investigate answers to those questions and
> > prioritise it compared to other ongoing efforts, before I could vote +1
> for
> > this.
> >
> > Couple of things to consider:
> > - would it be only a job manager/failure region recovery feature?
> > - would it require changes in CheckpointBarrierHandler,
> > CheckpointCoordinator classes?
> > - with `at-most-once` semantic theoretically speaking we could just drop
> > the current `CheckpointBarrier` handling/injecting code and avoid all of
> > the checkpoint alignment issues - we could just checkpoint all of the
> tasks
> > independently of one another. However maybe that could be a follow up
> > optimisation step?
> >
> > Piotrek
> >
> > > On 11 Jun 2019, at 10:53, Zili Chen <wander4...@gmail.com> wrote:
> > >
> > > Hi Xiaogang,
> > >
> > > It is an interesting topic.
> > >
> > > Notice that there is some effort to build a mature mllib of flink these
> > > days, it could be also possible for some ml cases trade off correctness
> > for
> > > timeliness or throughput. Excatly-once delivery excatly makes flink
> stand
> > > out but an at-most-once option would adapt flink to more scenarios.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道:
> > >
> > >> Flink offers a fault-tolerance mechanism to guarantee at-least-once
> and
> > >> exactly-once message delivery in case of failures. The mechanism works
> > well
> > >> in practice and makes Flink stand out among stream processing systems.
> > >>
> > >> But the guarantee on at-least-once and exactly-once delivery does not
> > come
> > >> without price. It typically requires to restart multiple tasks and
> fall
> > >> back to the place where the last checkpoint is taken. (Fined-grained
> > >> recovery can help alleviate the cost, but it still needs certain
> > efforts to
> > >> recover jobs.)
> > >>
> > >> In some senarios, users perfer quick recovery and will trade
> correctness
> > >> off. For example, in some online recommendation systems, timeliness is
> > far
> > >> more important than consistency. In such cases, we can restart only
> > those
> > >> failed tasks individually, and do not need to perform any rollback.
> > Though
> > >> some messages delivered to failed tasks may be lost, other tasks can
> > >> continuously provide service to users.
> > >>
> > >> Many of our users are demanding for at-most-once delivery in Flink.
> > What do
> > >> you think of the proposal? Any feedback is appreciated.
> > >>
> > >> Regards,
> > >> Xiaogang Shi
> > >>
> >
> >
>

Reply via email to