I think this is target for batch at the very beginning, the idea should be also work for both case, with different algorithm/strategy.
Ryan, since you are working on this, I will assign FLINK-10644 <https://issues.apache.org/jira/browse/FLINK-10644> to you. Jin > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org> wrote: > > Thanks for starting this discussion Ryan. I'm looking forward to your > design document about this feature. Quick question: Will it be a batch only > feature? If no, then it needs to take checkpointing into account as well. > > Cheers, > Till > > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com.invalid> > wrote: > >> Thanks yangyu for launching this discussion. >> >> I really like this proposal. We ever found this scene frequently that some >> long tail tasks to delay the total batch job execution time in production. >> We also have some thoughts for bringing this mechanism. Looking forward to >> your detail design doc, then we can discussion further. >> >> Best, >> Zhijiang >> ------------------------------------------------------------------ >> 发件人:Tao Yangyu <ryantao...@gmail.com> >> 发送时间:2018年11月6日(星期二) 11:01 >> 收件人:dev <dev@flink.apache.org> >> 主 题:[DISCUSS] Task speculative execution for Flink batch >> >> Hi everyone, >> >> We propose task speculative execution for Flink batch in this message as >> follows. >> >> In the batch mode, the job is usually divided into multiple parallel tasks >> executed cross many nodes in the cluster. It is common to encounter the >> performance degradation on some nodes due to hardware problems or accident >> I/O busy and high CPU load. This kind of degradation can probably cause the >> running tasks on the node to be quite slow that is so called long tail >> tasks. Although the long tail tasks will not fail, they can severely affect >> the total job running time. Flink task scheduler does not take this long >> tail problem into account currently. >> >> >> >> Here we propose the speculative execution strategy to handle the problem. >> The basic idea is to run a copy of task on another node when the original >> task is identified to be long tail. In more details, the speculative task >> will be triggered when the scheduler detects that the data processing >> throughput of a task is much slower than others. The speculative task is >> executed in parallel with the original one and share the same failure retry >> mechanism. Once either task complete, the scheduler admits its output as >> the final result and cancel the other running one. The preliminary >> experiments has demonstrated the effectiveness. >> >> >> The detailed design doc will be ready soon. Your reviews and comments will >> be much appreciated. >> >> >> Thanks! >> >> Ryan >> >>