+1 for the speculative execution for Flink batch, Speculative execution is
used in lots of batch execution engine like mr, tez and spark. This would
be a great improvement for Flink in batch scenario.

Jin Sun <isun...@gmail.com>于2018年11月7日周三 上午8:38写道:

> I think this is target for batch at the very beginning, the idea should be
> also work for both case, with different algorithm/strategy.
>
> Ryan, since you are working on this, I will assign FLINK-10644 <
> https://issues.apache.org/jira/browse/FLINK-10644> to you.
>
> Jin
>
> > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> >
> > Thanks for starting this discussion Ryan. I'm looking forward to your
> > design document about this feature. Quick question: Will it be a batch
> only
> > feature? If no, then it needs to take checkpointing into account as well.
> >
> > Cheers,
> > Till
> >
> > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com
> .invalid>
> > wrote:
> >
> >> Thanks yangyu for launching this discussion.
> >>
> >> I really like this proposal. We ever found this scene frequently that
> some
> >> long tail tasks to delay the total batch job execution time in
> production.
> >> We also have some thoughts for bringing this mechanism. Looking forward
> to
> >> your detail design doc, then we can discussion further.
> >>
> >> Best,
> >> Zhijiang
> >> ------------------------------------------------------------------
> >> 发件人:Tao Yangyu <ryantao...@gmail.com>
> >> 发送时间:2018年11月6日(星期二) 11:01
> >> 收件人:dev <dev@flink.apache.org>
> >> 主 题:[DISCUSS] Task speculative execution for Flink batch
> >>
> >> Hi everyone,
> >>
> >> We propose task speculative execution for Flink batch in this message as
> >> follows.
> >>
> >> In the batch mode, the job is usually divided into multiple parallel
> tasks
> >> executed cross many nodes in the cluster. It is common to encounter the
> >> performance degradation on some nodes due to hardware problems or
> accident
> >> I/O busy and high CPU load. This kind of degradation can probably cause
> the
> >> running tasks on the node to be quite slow that is so called long tail
> >> tasks. Although the long tail tasks will not fail, they can severely
> affect
> >> the total job running time. Flink task scheduler does not take this long
> >> tail problem into account currently.
> >>
> >>
> >>
> >> Here we propose the speculative execution strategy to handle the
> problem.
> >> The basic idea is to run a copy of task on another node when the
> original
> >> task is identified to be long tail. In more details, the speculative
> task
> >> will be triggered when the scheduler detects that the data processing
> >> throughput of a task is much slower than others. The speculative task is
> >> executed in parallel with the original one and share the same failure
> retry
> >> mechanism. Once either task complete, the scheduler admits its output as
> >> the final result and cancel the other running one. The preliminary
> >> experiments has demonstrated the effectiveness.
> >>
> >>
> >> The detailed design doc will be ready soon.  Your reviews and comments
> will
> >> be much appreciated.
> >>
> >>
> >> Thanks!
> >>
> >> Ryan
> >>
> >>
>
>

Reply via email to