Hi everyone, More and more users are running their batch jobs on Flink nowadays. One major problem they encounter is slow tasks running on hot/bad nodes, resulting in very long and uncontrollable execution time of batch jobs. This problem is a pain or even unacceptable in production. Many users have been asking for a solution for it.
Therefore, I'd like to revive the discussion of speculative execution to solve this problem. Weijun Wang, Jing Zhang, Lijie Wang and I had some offline discussions to refine the design[1]. We also implemented a PoC[2] and verified it using TPC-DS benchmarks and production jobs. Looking forward to your feedback! Thanks, Zhu [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job [2] https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc 刘建刚 <liujiangangp...@gmail.com> 于2021年12月13日周一 11:38写道: > Any progress on the feature? We have the same requirement in our company. > Since the soft and hard environment can be complex, it is normal to see a > slow task which determines the execution time of the flink job. > > <wangw...@sina.cn> 于2021年6月20日周日 22:35写道: > > > Hi everyone, > > > > I would like to kick off a discussion on speculative execution for batch > > job. > > I have created FLIP-168 [1] that clarifies our motivation to do this and > > some improvement proposals for the new design. > > It would be great to resolve the problem of long tail task in batch job. > > Please let me know your thoughts. Thanks. > > Regards, > > wangwj > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > > >