Hi, zhu Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think it's ok, I just have 3 small questions
1. How to judge whether the Execution Vertex belongs to a slow task. The current calculation method is: the current timestamp minus the timestamp of the execution deployment. If the execution time of this execution exceeds the baseline, then it is judged as a slow task. Normally this is no problem. But if an execution fails, the time may not be accurate. For example, the baseline is 59 minutes, and a task fails after 56 minutes of execution. In the worst case, it may take an additional 59 minutes to discover that the task is a slow task. 2. Speculative Scheduler's fault tolerance strategy. The strategy in FLIP is: if the Execution Vertex can be executed, even if the execution fails, the fault tolerance strategy will not be adopted. Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an execution. But isn't this dependency a bit too strong? To some extent, the fault tolerance strategy and the Slow task detection strategy are coupled together. 3. The value of the default configuration IMHO, prediction execution should only be required for relatively large-scale, very time-consuming and long-term jobs. If `slow-task-detector.execution-time.baseline-lower-bound` is too small, is it possible for the system to always start some additional tasks that have little effect? In the end, the user needs to reset this default configuration. Is it possible to consider a larger configuration. Of course, this part is best to listen to the suggestions of other community users. Best, Guowei On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <liujiangangp...@gmail.com> wrote: > +1 for the feature. > > Mang Zhang <zhangma...@163.com> 于2022年4月28日周四 11:36写道: > > > Hi zhu: > > > > > > This sounds like a great job! Thanks for your great job. > > In our company, there are already some jobs using Flink Batch, > > but everyone knows that the offline cluster has a lot more load than > > the online cluster, and the failure rate of the machine is also much > higher. > > If this work is done, we'd love to use it, it's simply awesome for > our > > flink users. > > thanks again! > > > > > > > > > > > > > > > > -- > > > > Best regards, > > Mang Zhang > > > > > > > > > > > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote: > > >Hi everyone, > > > > > >More and more users are running their batch jobs on Flink nowadays. > > >One major problem they encounter is slow tasks running on hot/bad > > >nodes, resulting in very long and uncontrollable execution time of > > >batch jobs. This problem is a pain or even unacceptable in > > >production. Many users have been asking for a solution for it. > > > > > >Therefore, I'd like to revive the discussion of speculative > > >execution to solve this problem. > > > > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline > > >discussions to refine the design[1]. We also implemented a PoC[2] > > >and verified it using TPC-DS benchmarks and production jobs. > > > > > >Looking forward to your feedback! > > > > > >Thanks, > > >Zhu > > > > > >[1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > > >[2] > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc > > > > > > > > >刘建刚 <liujiangangp...@gmail.com> 于2021年12月13日周一 11:38写道: > > > > > >> Any progress on the feature? We have the same requirement in our > > company. > > >> Since the soft and hard environment can be complex, it is normal to > see > > a > > >> slow task which determines the execution time of the flink job. > > >> > > >> <wangw...@sina.cn> 于2021年6月20日周日 22:35写道: > > >> > > >> > Hi everyone, > > >> > > > >> > I would like to kick off a discussion on speculative execution for > > batch > > >> > job. > > >> > I have created FLIP-168 [1] that clarifies our motivation to do this > > and > > >> > some improvement proposals for the new design. > > >> > It would be great to resolve the problem of long tail task in batch > > job. > > >> > Please let me know your thoughts. Thanks. > > >> > Regards, > > >> > wangwj > > >> > [1] > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > > >> > > > >> > > >