Hi everyone, According to the discussion and updates of the blocklist mechanism[1] (FLIP-224), I have updated FLIP-168 to make decision on itself to block identified slow nodes. A new configuration is also added to control how long a slow node should be blocked.
[1] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h Thanks, Zhu Zhu Zhu <reed...@gmail.com> 于2022年4月29日周五 14:36写道: > > Thank you for all the feedback! > > @Guowei Ma > Here's my thoughts for your questions: > >> 1. How to judge whether the Execution Vertex belongs to a slow task. > If a slow task fails and gets restarted, it may not be a slow task > anymore. Especially given that the nodes of the slow task may have been > blacklisted and the new task will be deployed to a new node. I think we > should again go through the slow task detection process to determine > whether it is a slow task. I agree that it is not ideal to take another > 59 mins to identify a slow task. To solve this problem, one idea is to > introduce a slow task detection strategy which identifies slow tasks > according to the throughput. This approach needs more thoughts and > experiments so we now target it to a future time. > > >> 2. The fault tolerance strategy and the Slow task detection strategy are > >> coupled > I don't think the fault tolerance and slow task detecting are coupled. > If a task fails while the ExecutionVertex still has a task in progress, > there is no need to start new executions for the vertex in the perspective > of fault tolerance. If the remaining task is slow, in the next slow task > detecting, a speculative execution will be created and deployed for it. > This, however, is a normal speculative execution process rather than a > failure recovery process. In this way, the fault tolerance and slow task > detecting work without knowing each other and the job can still recover > from failures and guarantee there are speculative executions for slow tasks. > > >> 3. Default value of > >> `slow-task-detector.execution-time.baseline-lower-bound` is too small > From what I see in production and knowing from users, there are many > batch jobs of a relatively small scale (a few terabytes, hundreds of > gigabytes). Tasks of these jobs can finish in minutes, so that a > `1 min` lowbound is large enough. Besides that, I think the out-of-box > experience is more important for users running small scale jobs. > > Thanks, > Zhu > > Guowei Ma <guowei....@gmail.com> 于2022年4月28日周四 17:55写道: >> >> Hi, zhu >> >> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think >> it's ok, I just have 3 small questions >> >> 1. How to judge whether the Execution Vertex belongs to a slow task. >> The current calculation method is: the current timestamp minus the >> timestamp of the execution deployment. If the execution time of this >> execution exceeds the baseline, then it is judged as a slow task. Normally >> this is no problem. But if an execution fails, the time may not be >> accurate. For example, the baseline is 59 minutes, and a task fails after >> 56 minutes of execution. In the worst case, it may take an additional 59 >> minutes to discover that the task is a slow task. >> >> 2. Speculative Scheduler's fault tolerance strategy. >> The strategy in FLIP is: if the Execution Vertex can be executed, even if >> the execution fails, the fault tolerance strategy will not be adopted. >> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an >> execution. But isn't this dependency a bit too strong? To some extent, the >> fault tolerance strategy and the Slow task detection strategy are coupled >> together. >> >> >> 3. The value of the default configuration >> IMHO, prediction execution should only be required for relatively >> large-scale, very time-consuming and long-term jobs. >> If `slow-task-detector.execution-time.baseline-lower-bound` is too small, >> is it possible for the system to always start some additional tasks that >> have little effect? In the end, the user needs to reset this default >> configuration. Is it possible to consider a larger configuration. Of >> course, this part is best to listen to the suggestions of other community >> users. >> >> Best, >> Guowei >> >> >> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <liujiangangp...@gmail.com> >> wrote: >> >> > +1 for the feature. >> > >> > Mang Zhang <zhangma...@163.com> 于2022年4月28日周四 11:36写道: >> > >> > > Hi zhu: >> > > >> > > >> > > This sounds like a great job! Thanks for your great job. >> > > In our company, there are already some jobs using Flink Batch, >> > > but everyone knows that the offline cluster has a lot more load than >> > > the online cluster, and the failure rate of the machine is also much >> > higher. >> > > If this work is done, we'd love to use it, it's simply awesome for >> > our >> > > flink users. >> > > thanks again! >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > >> > > Best regards, >> > > Mang Zhang >> > > >> > > >> > > >> > > >> > > >> > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote: >> > > >Hi everyone, >> > > > >> > > >More and more users are running their batch jobs on Flink nowadays. >> > > >One major problem they encounter is slow tasks running on hot/bad >> > > >nodes, resulting in very long and uncontrollable execution time of >> > > >batch jobs. This problem is a pain or even unacceptable in >> > > >production. Many users have been asking for a solution for it. >> > > > >> > > >Therefore, I'd like to revive the discussion of speculative >> > > >execution to solve this problem. >> > > > >> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline >> > > >discussions to refine the design[1]. We also implemented a PoC[2] >> > > >and verified it using TPC-DS benchmarks and production jobs. >> > > > >> > > >Looking forward to your feedback! >> > > > >> > > >Thanks, >> > > >Zhu >> > > > >> > > >[1] >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job >> > > >[2] >> > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc >> > > > >> > > > >> > > >刘建刚 <liujiangangp...@gmail.com> 于2021年12月13日周一 11:38写道: >> > > > >> > > >> Any progress on the feature? We have the same requirement in our >> > > company. >> > > >> Since the soft and hard environment can be complex, it is normal to >> > see >> > > a >> > > >> slow task which determines the execution time of the flink job. >> > > >> >> > > >> <wangw...@sina.cn> 于2021年6月20日周日 22:35写道: >> > > >> >> > > >> > Hi everyone, >> > > >> > >> > > >> > I would like to kick off a discussion on speculative execution for >> > > batch >> > > >> > job. >> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do this >> > > and >> > > >> > some improvement proposals for the new design. >> > > >> > It would be great to resolve the problem of long tail task in batch >> > > job. >> > > >> > Please let me know your thoughts. Thanks. >> > > >> > Regards, >> > > >> > wangwj >> > > >> > [1] >> > > >> > >> > > >> >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job >> > > >> > >> > > >> >> > > >> >