Re: [DISCUSS] Task speculative execution for Flink batch

Tao Yangyu Wed, 07 Nov 2018 00:50:19 -0800

Thanks so much for your all feedbacks!

Yes, as mentioned above by Jin Sun, the design currently targets batch to
explore the general framework and basic modules. The strategy could be also
applied to stream with some extended code, for example, the result
commitment.


Jin Sun <isun...@gmail.com> 于2018年11月7日周三 上午8:38写道：

> I think this is target for batch at the very beginning, the idea should be
> also work for both case, with different algorithm/strategy.
>
> Ryan, since you are working on this, I will assign FLINK-10644 <
> https://issues.apache.org/jira/browse/FLINK-10644> to you.
>
> Jin
>
> > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> >
> > Thanks for starting this discussion Ryan. I'm looking forward to your
> > design document about this feature. Quick question: Will it be a batch
> only
> > feature? If no, then it needs to take checkpointing into account as well.
> >
> > Cheers,
> > Till
> >
> > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com
> .invalid>
> > wrote:
> >
> >> Thanks yangyu for launching this discussion.
> >>
> >> I really like this proposal. We ever found this scene frequently that
> some
> >> long tail tasks to delay the total batch job execution time in
> production.
> >> We also have some thoughts for bringing this mechanism. Looking forward
> to
> >> your detail design doc, then we can discussion further.
> >>
> >> Best,
> >> Zhijiang
> >> ------------------------------------------------------------------
> >> 发件人：Tao Yangyu <ryantao...@gmail.com>
> >> 发送时间：2018年11月6日(星期二) 11:01
> >> 收件人：dev <dev@flink.apache.org>
> >> 主 题：[DISCUSS] Task speculative execution for Flink batch
> >>
> >> Hi everyone,
> >>
> >> We propose task speculative execution for Flink batch in this message as
> >> follows.
> >>
> >> In the batch mode, the job is usually divided into multiple parallel
> tasks
> >> executed cross many nodes in the cluster. It is common to encounter the
> >> performance degradation on some nodes due to hardware problems or
> accident
> >> I/O busy and high CPU load. This kind of degradation can probably cause
> the
> >> running tasks on the node to be quite slow that is so called long tail
> >> tasks. Although the long tail tasks will not fail, they can severely
> affect
> >> the total job running time. Flink task scheduler does not take this long
> >> tail problem into account currently.
> >>
> >>
> >>
> >> Here we propose the speculative execution strategy to handle the
> problem.
> >> The basic idea is to run a copy of task on another node when the
> original
> >> task is identified to be long tail. In more details, the speculative
> task
> >> will be triggered when the scheduler detects that the data processing
> >> throughput of a task is much slower than others. The speculative task is
> >> executed in parallel with the original one and share the same failure
> retry
> >> mechanism. Once either task complete, the scheduler admits its output as
> >> the final result and cancel the other running one. The preliminary
> >> experiments has demonstrated the effectiveness.
> >>
> >>
> >> The detailed design doc will be ready soon.  Your reviews and comments
> will
> >> be much appreciated.
> >>
> >>
> >> Thanks!
> >>
> >> Ryan
> >>
> >>
>
>

Re: [DISCUSS] Task speculative execution for Flink batch

Reply via email to