Thanks for starting this discussion Ryan. I'm looking forward to your
design document about this feature. Quick question: Will it be a batch only
feature? If no, then it needs to take checkpointing into account as well.

Cheers,
Till

On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com.invalid>
wrote:

> Thanks yangyu for launching this discussion.
>
> I really like this proposal. We ever found this scene frequently that some
> long tail tasks to delay the total batch job execution time in production.
> We also have some thoughts for bringing this mechanism. Looking forward to
> your detail design doc, then we can discussion further.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人:Tao Yangyu <ryantao...@gmail.com>
> 发送时间:2018年11月6日(星期二) 11:01
> 收件人:dev <dev@flink.apache.org>
> 主 题:[DISCUSS] Task speculative execution for Flink batch
>
> Hi everyone,
>
> We propose task speculative execution for Flink batch in this message as
> follows.
>
> In the batch mode, the job is usually divided into multiple parallel tasks
> executed cross many nodes in the cluster. It is common to encounter the
> performance degradation on some nodes due to hardware problems or accident
> I/O busy and high CPU load. This kind of degradation can probably cause the
> running tasks on the node to be quite slow that is so called long tail
> tasks. Although the long tail tasks will not fail, they can severely affect
> the total job running time. Flink task scheduler does not take this long
> tail problem into account currently.
>
>
>
> Here we propose the speculative execution strategy to handle the problem.
> The basic idea is to run a copy of task on another node when the original
> task is identified to be long tail. In more details, the speculative task
> will be triggered when the scheduler detects that the data processing
> throughput of a task is much slower than others. The speculative task is
> executed in parallel with the original one and share the same failure retry
> mechanism. Once either task complete, the scheduler admits its output as
> the final result and cancel the other running one. The preliminary
> experiments has demonstrated the effectiveness.
>
>
> The detailed design doc will be ready soon.  Your reviews and comments will
> be much appreciated.
>
>
> Thanks!
>
> Ryan
>
>

Reply via email to