+1 for the speculative execution for Flink batch, Speculative execution is used in lots of batch execution engine like mr, tez and spark. This would be a great improvement for Flink in batch scenario.
Jin Sun <isun...@gmail.com>于2018年11月7日周三 上午8:38写道: > I think this is target for batch at the very beginning, the idea should be > also work for both case, with different algorithm/strategy. > > Ryan, since you are working on this, I will assign FLINK-10644 < > https://issues.apache.org/jira/browse/FLINK-10644> to you. > > Jin > > > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org> wrote: > > > > Thanks for starting this discussion Ryan. I'm looking forward to your > > design document about this feature. Quick question: Will it be a batch > only > > feature? If no, then it needs to take checkpointing into account as well. > > > > Cheers, > > Till > > > > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com > .invalid> > > wrote: > > > >> Thanks yangyu for launching this discussion. > >> > >> I really like this proposal. We ever found this scene frequently that > some > >> long tail tasks to delay the total batch job execution time in > production. > >> We also have some thoughts for bringing this mechanism. Looking forward > to > >> your detail design doc, then we can discussion further. > >> > >> Best, > >> Zhijiang > >> ------------------------------------------------------------------ > >> 发件人:Tao Yangyu <ryantao...@gmail.com> > >> 发送时间:2018年11月6日(星期二) 11:01 > >> 收件人:dev <dev@flink.apache.org> > >> 主 题:[DISCUSS] Task speculative execution for Flink batch > >> > >> Hi everyone, > >> > >> We propose task speculative execution for Flink batch in this message as > >> follows. > >> > >> In the batch mode, the job is usually divided into multiple parallel > tasks > >> executed cross many nodes in the cluster. It is common to encounter the > >> performance degradation on some nodes due to hardware problems or > accident > >> I/O busy and high CPU load. This kind of degradation can probably cause > the > >> running tasks on the node to be quite slow that is so called long tail > >> tasks. Although the long tail tasks will not fail, they can severely > affect > >> the total job running time. Flink task scheduler does not take this long > >> tail problem into account currently. > >> > >> > >> > >> Here we propose the speculative execution strategy to handle the > problem. > >> The basic idea is to run a copy of task on another node when the > original > >> task is identified to be long tail. In more details, the speculative > task > >> will be triggered when the scheduler detects that the data processing > >> throughput of a task is much slower than others. The speculative task is > >> executed in parallel with the original one and share the same failure > retry > >> mechanism. Once either task complete, the scheduler admits its output as > >> the final result and cancel the other running one. The preliminary > >> experiments has demonstrated the effectiveness. > >> > >> > >> The detailed design doc will be ready soon. Your reviews and comments > will > >> be much appreciated. > >> > >> > >> Thanks! > >> > >> Ryan > >> > >> > >