Re: [DISCUSS] Task speculative execution for Flink batch

Jin Sun Tue, 06 Nov 2018 16:38:55 -0800

I think this is target for batch at the very beginning, the idea should be also 
work for both case, with different algorithm/strategy.


Ryan, since you are working on this, I will assign FLINK-10644 
<https://issues.apache.org/jira/browse/FLINK-10644> to you.

Jin

> On Nov 6, 2018, at 4:45 AM, Till Rohrmann <[email protected]> wrote:
> 
> Thanks for starting this discussion Ryan. I'm looking forward to your
> design document about this feature. Quick question: Will it be a batch only
> feature? If no, then it needs to take checkpointing into account as well.
> 
> Cheers,
> Till
> 
> On Tue, Nov 6, 2018 at 4:29 AM zhijiang <[email protected]>
> wrote:
> 
>> Thanks yangyu for launching this discussion.
>> 
>> I really like this proposal. We ever found this scene frequently that some
>> long tail tasks to delay the total batch job execution time in production.
>> We also have some thoughts for bringing this mechanism. Looking forward to
>> your detail design doc, then we can discussion further.
>> 
>> Best,
>> Zhijiang
>> ------------------------------------------------------------------
>> 发件人：Tao Yangyu <[email protected]>
>> 发送时间：2018年11月6日(星期二) 11:01
>> 收件人：dev <[email protected]>
>> 主 题：[DISCUSS] Task speculative execution for Flink batch
>> 
>> Hi everyone,
>> 
>> We propose task speculative execution for Flink batch in this message as
>> follows.
>> 
>> In the batch mode, the job is usually divided into multiple parallel tasks
>> executed cross many nodes in the cluster. It is common to encounter the
>> performance degradation on some nodes due to hardware problems or accident
>> I/O busy and high CPU load. This kind of degradation can probably cause the
>> running tasks on the node to be quite slow that is so called long tail
>> tasks. Although the long tail tasks will not fail, they can severely affect
>> the total job running time. Flink task scheduler does not take this long
>> tail problem into account currently.
>> 
>> 
>> 
>> Here we propose the speculative execution strategy to handle the problem.
>> The basic idea is to run a copy of task on another node when the original
>> task is identified to be long tail. In more details, the speculative task
>> will be triggered when the scheduler detects that the data processing
>> throughput of a task is much slower than others. The speculative task is
>> executed in parallel with the original one and share the same failure retry
>> mechanism. Once either task complete, the scheduler admits its output as
>> the final result and cancel the other running one. The preliminary
>> experiments has demonstrated the effectiveness.
>> 
>> 
>> The detailed design doc will be ready soon.  Your reviews and comments will
>> be much appreciated.
>> 
>> 
>> Thanks!
>> 
>> Ryan
>> 
>>

Re: [DISCUSS] Task speculative execution for Flink batch

Reply via email to