Re: [DISCUSS] Task speculative execution for Flink batch

Tao Yangyu Sun, 18 Nov 2018 20:08:07 -0800

Thanks Xiaowei for the inspiring comments!
Yes, we could increase the granularity of speculation from a single task to
a bundle of successive tasks especially for the pipelined channel.


Xiaowei Jiang <xiaow...@gmail.com> 于2018年11月18日周日 下午2:24写道：

> Thanks Yangyu for the nice design doc! One thing to consider is the
> granularity of speculation. Multiple task may propagate data through
> pipeline mode. In such case, fixing a single task may not be enough. But
> you might be able to fix this problem by increasing the granularity of
> speculation. The traditional case of a single speculative task can be
> considered as a special case of this.
>
> Xiaowei
>
> On Sat, Nov 17, 2018 at 10:27 PM Tao Yangyu <ryantao...@gmail.com> wrote:
>
> > Hi all，
> >
> > After refined, the detailed design doc is here:
> >
> >
> https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit?usp=sharing
> >
> > Your kind reviews and comments are very appreciated and will help so much
> > the feature to be completed.
> >
> > Best,
> > Ryan
> >
> >
> > Tao Yangyu <ryantao...@gmail.com> 于2018年11月7日周三 下午4:49写道：
> >
> > > Thanks so much for your all feedbacks!
> > >
> > > Yes, as mentioned above by Jin Sun, the design currently targets batch
> to
> > > explore the general framework and basic modules. The strategy could be
> > also
> > > applied to stream with some extended code, for example, the result
> > > commitment.
> > >
> > > Jin Sun <isun...@gmail.com> 于2018年11月7日周三 上午8:38写道：
> > >
> > >> I think this is target for batch at the very beginning, the idea
> should
> > >> be also work for both case, with different algorithm/strategy.
> > >>
> > >> Ryan, since you are working on this, I will assign FLINK-10644 <
> > >> https://issues.apache.org/jira/browse/FLINK-10644> to you.
> > >>
> > >> Jin
> > >>
> > >> > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org>
> > wrote:
> > >> >
> > >> > Thanks for starting this discussion Ryan. I'm looking forward to
> your
> > >> > design document about this feature. Quick question: Will it be a
> batch
> > >> only
> > >> > feature? If no, then it needs to take checkpointing into account as
> > >> well.
> > >> >
> > >> > Cheers,
> > >> > Till
> > >> >
> > >> > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com
> > >> .invalid>
> > >> > wrote:
> > >> >
> > >> >> Thanks yangyu for launching this discussion.
> > >> >>
> > >> >> I really like this proposal. We ever found this scene frequently
> that
> > >> some
> > >> >> long tail tasks to delay the total batch job execution time in
> > >> production.
> > >> >> We also have some thoughts for bringing this mechanism. Looking
> > >> forward to
> > >> >> your detail design doc, then we can discussion further.
> > >> >>
> > >> >> Best,
> > >> >> Zhijiang
> > >> >> ------------------------------------------------------------------
> > >> >> 发件人：Tao Yangyu <ryantao...@gmail.com>
> > >> >> 发送时间：2018年11月6日(星期二) 11:01
> > >> >> 收件人：dev <dev@flink.apache.org>
> > >> >> 主 题：[DISCUSS] Task speculative execution for Flink batch
> > >> >>
> > >> >> Hi everyone,
> > >> >>
> > >> >> We propose task speculative execution for Flink batch in this
> message
> > >> as
> > >> >> follows.
> > >> >>
> > >> >> In the batch mode, the job is usually divided into multiple
> parallel
> > >> tasks
> > >> >> executed cross many nodes in the cluster. It is common to encounter
> > the
> > >> >> performance degradation on some nodes due to hardware problems or
> > >> accident
> > >> >> I/O busy and high CPU load. This kind of degradation can probably
> > >> cause the
> > >> >> running tasks on the node to be quite slow that is so called long
> > tail
> > >> >> tasks. Although the long tail tasks will not fail, they can
> severely
> > >> affect
> > >> >> the total job running time. Flink task scheduler does not take this
> > >> long
> > >> >> tail problem into account currently.
> > >> >>
> > >> >>
> > >> >>
> > >> >> Here we propose the speculative execution strategy to handle the
> > >> problem.
> > >> >> The basic idea is to run a copy of task on another node when the
> > >> original
> > >> >> task is identified to be long tail. In more details, the
> speculative
> > >> task
> > >> >> will be triggered when the scheduler detects that the data
> processing
> > >> >> throughput of a task is much slower than others. The speculative
> task
> > >> is
> > >> >> executed in parallel with the original one and share the same
> failure
> > >> retry
> > >> >> mechanism. Once either task complete, the scheduler admits its
> output
> > >> as
> > >> >> the final result and cancel the other running one. The preliminary
> > >> >> experiments has demonstrated the effectiveness.
> > >> >>
> > >> >>
> > >> >> The detailed design doc will be ready soon.  Your reviews and
> > comments
> > >> will
> > >> >> be much appreciated.
> > >> >>
> > >> >>
> > >> >> Thanks!
> > >> >>
> > >> >> Ryan
> > >> >>
> > >> >>
> > >>
> > >>
> >
>

Re: [DISCUSS] Task speculative execution for Flink batch

Reply via email to