Hi everyone,

More and more users are running their batch jobs on Flink nowadays.
One major problem they encounter is slow tasks running on hot/bad
nodes, resulting in very long and uncontrollable execution time of
batch jobs. This problem is a pain or even unacceptable in
production. Many users have been asking for a solution for it.

Therefore, I'd like to revive the discussion of speculative
execution to solve this problem.

Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
discussions to refine the design[1]. We also implemented a PoC[2]
and verified it using TPC-DS benchmarks and production jobs.

Looking forward to your feedback!

Thanks,
Zhu

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
[2] https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc


刘建刚 <liujiangangp...@gmail.com> 于2021年12月13日周一 11:38写道:

> Any progress on the feature? We have the same requirement in our company.
> Since the soft and hard environment can be complex, it is normal to see a
> slow task which determines the execution time of the flink job.
>
> <wangw...@sina.cn> 于2021年6月20日周日 22:35写道:
>
> > Hi everyone,
> >
> > I would like to kick off a discussion on speculative execution for batch
> > job.
> > I have created FLIP-168 [1] that clarifies our motivation to do this and
> > some improvement proposals for the new design.
> > It would be great to resolve the problem of long tail task in batch job.
> > Please let me know your thoughts. Thanks.
> >   Regards,
> > wangwj
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >
>

Reply via email to