+1, Thanks Yangyu for proposing this very useful feature. Looking forward to the design doc.
On Wed, Nov 7, 2018 at 10:15 AM SHI Xiaogang <shixiaoga...@gmail.com> wrote: > Hi, > > +1 for the speculative execution. > > It will be more great if it can work well with exisitng checkpointing and > pipelined execution. That way, we can move a further step towards the > unification of batch and stream processing. > > Regards, > Xiaogang > > Jeff Zhang <zjf...@gmail.com> 于2018年11月7日周三 上午9:40写道: > > > +1 for the speculative execution for Flink batch, Speculative execution > is > > used in lots of batch execution engine like mr, tez and spark. This would > > be a great improvement for Flink in batch scenario. > > > > Jin Sun <isun...@gmail.com>于2018年11月7日周三 上午8:38写道: > > > > > I think this is target for batch at the very beginning, the idea should > > be > > > also work for both case, with different algorithm/strategy. > > > > > > Ryan, since you are working on this, I will assign FLINK-10644 < > > > https://issues.apache.org/jira/browse/FLINK-10644> to you. > > > > > > Jin > > > > > > > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrm...@apache.org> > > wrote: > > > > > > > > Thanks for starting this discussion Ryan. I'm looking forward to your > > > > design document about this feature. Quick question: Will it be a > batch > > > only > > > > feature? If no, then it needs to take checkpointing into account as > > well. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang...@aliyun.com > > > .invalid> > > > > wrote: > > > > > > > >> Thanks yangyu for launching this discussion. > > > >> > > > >> I really like this proposal. We ever found this scene frequently > that > > > some > > > >> long tail tasks to delay the total batch job execution time in > > > production. > > > >> We also have some thoughts for bringing this mechanism. Looking > > forward > > > to > > > >> your detail design doc, then we can discussion further. > > > >> > > > >> Best, > > > >> Zhijiang > > > >> ------------------------------------------------------------------ > > > >> 发件人:Tao Yangyu <ryantao...@gmail.com> > > > >> 发送时间:2018年11月6日(星期二) 11:01 > > > >> 收件人:dev <dev@flink.apache.org> > > > >> 主 题:[DISCUSS] Task speculative execution for Flink batch > > > >> > > > >> Hi everyone, > > > >> > > > >> We propose task speculative execution for Flink batch in this > message > > as > > > >> follows. > > > >> > > > >> In the batch mode, the job is usually divided into multiple parallel > > > tasks > > > >> executed cross many nodes in the cluster. It is common to encounter > > the > > > >> performance degradation on some nodes due to hardware problems or > > > accident > > > >> I/O busy and high CPU load. This kind of degradation can probably > > cause > > > the > > > >> running tasks on the node to be quite slow that is so called long > tail > > > >> tasks. Although the long tail tasks will not fail, they can severely > > > affect > > > >> the total job running time. Flink task scheduler does not take this > > long > > > >> tail problem into account currently. > > > >> > > > >> > > > >> > > > >> Here we propose the speculative execution strategy to handle the > > > problem. > > > >> The basic idea is to run a copy of task on another node when the > > > original > > > >> task is identified to be long tail. In more details, the speculative > > > task > > > >> will be triggered when the scheduler detects that the data > processing > > > >> throughput of a task is much slower than others. The speculative > task > > is > > > >> executed in parallel with the original one and share the same > failure > > > retry > > > >> mechanism. Once either task complete, the scheduler admits its > output > > as > > > >> the final result and cancel the other running one. The preliminary > > > >> experiments has demonstrated the effectiveness. > > > >> > > > >> > > > >> The detailed design doc will be ready soon. Your reviews and > comments > > > will > > > >> be much appreciated. > > > >> > > > >> > > > >> Thanks! > > > >> > > > >> Ryan > > > >> > > > >> > > > > > > > > >