Re: [DISCUSS] Task speculative execution for Flink batch

Tao Yangyu Sat, 17 Nov 2018 06:27:14 -0800

Hi all，

After refined, the detailed design doc is here:
https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit?usp=sharing


Your kind reviews and comments are very appreciated and will help so much
the feature to be completed.

Best,
Ryan


Tao Yangyu <[email protected]> 于2018年11月7日周三 下午4:49写道：

> Thanks so much for your all feedbacks!
>
> Yes, as mentioned above by Jin Sun, the design currently targets batch to
> explore the general framework and basic modules. The strategy could be also
> applied to stream with some extended code, for example, the result
> commitment.
>
> Jin Sun <[email protected]> 于2018年11月7日周三 上午8:38写道：
>
>> I think this is target for batch at the very beginning, the idea should
>> be also work for both case, with different algorithm/strategy.
>>
>> Ryan, since you are working on this, I will assign FLINK-10644 <
>> https://issues.apache.org/jira/browse/FLINK-10644> to you.
>>
>> Jin
>>
>> > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <[email protected]> wrote:
>> >
>> > Thanks for starting this discussion Ryan. I'm looking forward to your
>> > design document about this feature. Quick question: Will it be a batch
>> only
>> > feature? If no, then it needs to take checkpointing into account as
>> well.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <[email protected]
>> .invalid>
>> > wrote:
>> >
>> >> Thanks yangyu for launching this discussion.
>> >>
>> >> I really like this proposal. We ever found this scene frequently that
>> some
>> >> long tail tasks to delay the total batch job execution time in
>> production.
>> >> We also have some thoughts for bringing this mechanism. Looking
>> forward to
>> >> your detail design doc, then we can discussion further.
>> >>
>> >> Best,
>> >> Zhijiang
>> >> ------------------------------------------------------------------
>> >> 发件人：Tao Yangyu <[email protected]>
>> >> 发送时间：2018年11月6日(星期二) 11:01
>> >> 收件人：dev <[email protected]>
>> >> 主 题：[DISCUSS] Task speculative execution for Flink batch
>> >>
>> >> Hi everyone,
>> >>
>> >> We propose task speculative execution for Flink batch in this message
>> as
>> >> follows.
>> >>
>> >> In the batch mode, the job is usually divided into multiple parallel
>> tasks
>> >> executed cross many nodes in the cluster. It is common to encounter the
>> >> performance degradation on some nodes due to hardware problems or
>> accident
>> >> I/O busy and high CPU load. This kind of degradation can probably
>> cause the
>> >> running tasks on the node to be quite slow that is so called long tail
>> >> tasks. Although the long tail tasks will not fail, they can severely
>> affect
>> >> the total job running time. Flink task scheduler does not take this
>> long
>> >> tail problem into account currently.
>> >>
>> >>
>> >>
>> >> Here we propose the speculative execution strategy to handle the
>> problem.
>> >> The basic idea is to run a copy of task on another node when the
>> original
>> >> task is identified to be long tail. In more details, the speculative
>> task
>> >> will be triggered when the scheduler detects that the data processing
>> >> throughput of a task is much slower than others. The speculative task
>> is
>> >> executed in parallel with the original one and share the same failure
>> retry
>> >> mechanism. Once either task complete, the scheduler admits its output
>> as
>> >> the final result and cancel the other running one. The preliminary
>> >> experiments has demonstrated the effectiveness.
>> >>
>> >>
>> >> The detailed design doc will be ready soon.  Your reviews and comments
>> will
>> >> be much appreciated.
>> >>
>> >>
>> >> Thanks!
>> >>
>> >> Ryan
>> >>
>> >>
>>
>>

Re: [DISCUSS] Task speculative execution for Flink batch

Reply via email to