Hi, +1 for the speculative execution.
It will be more great if it can work well with exisitng checkpointing and pipelined execution. That way, we can move a further step towards the unification of batch and stream processing. Regards, Xiaogang Jeff Zhang <zjf...@gmail.com> 于2018年11月7日周三 上午9:40写道: > +1 for the speculative execution for Flink batch, Speculative execution is > used in lots of batch execution engine like mr, tez and spark. This would > be a great improvement for Flink in batch scenario. > > Jin Sun <isun...@gmail.com>于2018年11月7日周三 上午8:38写道: > > > I think this is target for batch at the very beginning, the idea should > be > > also work for both case, with different algorithm/strategy. > > > > Ryan, since you are working on this, I will assign FLINK-10644 < > > https://issues.apache.org/jira/browse/FLINK-10644> to you. > > > > Jin > > > > > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > > > > > > Thanks for starting this discussion Ryan. I'm looking forward to your > > > design document about this feature. Quick question: Will it be a batch > > only > > > feature? If no, then it needs to take checkpointing into account as > well. > > > > > > Cheers, > > > Till > > > > > > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com > > .invalid> > > > wrote: > > > > > >> Thanks yangyu for launching this discussion. > > >> > > >> I really like this proposal. We ever found this scene frequently that > > some > > >> long tail tasks to delay the total batch job execution time in > > production. > > >> We also have some thoughts for bringing this mechanism. Looking > forward > > to > > >> your detail design doc, then we can discussion further. > > >> > > >> Best, > > >> Zhijiang > > >> ------------------------------------------------------------------ > > >> 发件人:Tao Yangyu <ryantao...@gmail.com> > > >> 发送时间:2018年11月6日(星期二) 11:01 > > >> 收件人:dev <dev@flink.apache.org> > > >> 主 题:[DISCUSS] Task speculative execution for Flink batch > > >> > > >> Hi everyone, > > >> > > >> We propose task speculative execution for Flink batch in this message > as > > >> follows. > > >> > > >> In the batch mode, the job is usually divided into multiple parallel > > tasks > > >> executed cross many nodes in the cluster. It is common to encounter > the > > >> performance degradation on some nodes due to hardware problems or > > accident > > >> I/O busy and high CPU load. This kind of degradation can probably > cause > > the > > >> running tasks on the node to be quite slow that is so called long tail > > >> tasks. Although the long tail tasks will not fail, they can severely > > affect > > >> the total job running time. Flink task scheduler does not take this > long > > >> tail problem into account currently. > > >> > > >> > > >> > > >> Here we propose the speculative execution strategy to handle the > > problem. > > >> The basic idea is to run a copy of task on another node when the > > original > > >> task is identified to be long tail. In more details, the speculative > > task > > >> will be triggered when the scheduler detects that the data processing > > >> throughput of a task is much slower than others. The speculative task > is > > >> executed in parallel with the original one and share the same failure > > retry > > >> mechanism. Once either task complete, the scheduler admits its output > as > > >> the final result and cancel the other running one. The preliminary > > >> experiments has demonstrated the effectiveness. > > >> > > >> > > >> The detailed design doc will be ready soon. Your reviews and comments > > will > > >> be much appreciated. > > >> > > >> > > >> Thanks! > > >> > > >> Ryan > > >> > > >> > > > > >