Re: [DISCUSS] Task speculative execution for Flink batch

SHI Xiaogang Tue, 06 Nov 2018 18:16:08 -0800

Hi,

+1 for the speculative execution.


It will be more great if it can work well with exisitng checkpointing and
pipelined execution. That way, we can move a further step towards the
unification of batch and stream processing.

Regards,
Xiaogang

Jeff Zhang <[email protected]> 于2018年11月7日周三 上午9:40写道：

> +1 for the speculative execution for Flink batch, Speculative execution is
> used in lots of batch execution engine like mr, tez and spark. This would
> be a great improvement for Flink in batch scenario.
>
> Jin Sun <[email protected]>于2018年11月7日周三 上午8:38写道：
>
> > I think this is target for batch at the very beginning, the idea should
> be
> > also work for both case, with different algorithm/strategy.
> >
> > Ryan, since you are working on this, I will assign FLINK-10644 <
> > https://issues.apache.org/jira/browse/FLINK-10644> to you.
> >
> > Jin
> >
> > > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <[email protected]>
> wrote:
> > >
> > > Thanks for starting this discussion Ryan. I'm looking forward to your
> > > design document about this feature. Quick question: Will it be a batch
> > only
> > > feature? If no, then it needs to take checkpointing into account as
> well.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <[email protected]
> > .invalid>
> > > wrote:
> > >
> > >> Thanks yangyu for launching this discussion.
> > >>
> > >> I really like this proposal. We ever found this scene frequently that
> > some
> > >> long tail tasks to delay the total batch job execution time in
> > production.
> > >> We also have some thoughts for bringing this mechanism. Looking
> forward
> > to
> > >> your detail design doc, then we can discussion further.
> > >>
> > >> Best,
> > >> Zhijiang
> > >> ------------------------------------------------------------------
> > >> 发件人：Tao Yangyu <[email protected]>
> > >> 发送时间：2018年11月6日(星期二) 11:01
> > >> 收件人：dev <[email protected]>
> > >> 主 题：[DISCUSS] Task speculative execution for Flink batch
> > >>
> > >> Hi everyone,
> > >>
> > >> We propose task speculative execution for Flink batch in this message
> as
> > >> follows.
> > >>
> > >> In the batch mode, the job is usually divided into multiple parallel
> > tasks
> > >> executed cross many nodes in the cluster. It is common to encounter
> the
> > >> performance degradation on some nodes due to hardware problems or
> > accident
> > >> I/O busy and high CPU load. This kind of degradation can probably
> cause
> > the
> > >> running tasks on the node to be quite slow that is so called long tail
> > >> tasks. Although the long tail tasks will not fail, they can severely
> > affect
> > >> the total job running time. Flink task scheduler does not take this
> long
> > >> tail problem into account currently.
> > >>
> > >>
> > >>
> > >> Here we propose the speculative execution strategy to handle the
> > problem.
> > >> The basic idea is to run a copy of task on another node when the
> > original
> > >> task is identified to be long tail. In more details, the speculative
> > task
> > >> will be triggered when the scheduler detects that the data processing
> > >> throughput of a task is much slower than others. The speculative task
> is
> > >> executed in parallel with the original one and share the same failure
> > retry
> > >> mechanism. Once either task complete, the scheduler admits its output
> as
> > >> the final result and cancel the other running one. The preliminary
> > >> experiments has demonstrated the effectiveness.
> > >>
> > >>
> > >> The detailed design doc will be ready soon.  Your reviews and comments
> > will
> > >> be much appreciated.
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> Ryan
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Task speculative execution for Flink batch

Reply via email to