Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Zhu Zhu Tue, 10 May 2022 21:30:48 -0700

Hi everyone,

According to the discussion and updates of the blocklist
mechanism[1] (FLIP-224), I have updated FLIP-168 to make
decision on itself to block identified slow nodes. A new
configuration is also added to control how long a slow
node should be blocked.


[1] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h

Thanks,
Zhu

Zhu Zhu <[email protected]> 于2022年4月29日周五 14:36写道：
>
> Thank you for all the feedback!
>
> @Guowei Ma
> Here's my thoughts for your questions:
> >> 1. How to judge whether the Execution Vertex belongs to a slow task.
> If a slow task fails and gets restarted, it may not be a slow task
> anymore. Especially given that the nodes of the slow task may have been
> blacklisted and the new task will be deployed to a new node. I think we
> should again go through the slow task detection process to determine
> whether it is a slow task. I agree that it is not ideal to take another
> 59 mins to identify a slow task. To solve this problem, one idea is to
> introduce a slow task detection strategy which identifies slow tasks
> according to the throughput. This approach needs more thoughts and
> experiments so we now target it to a future time.
>
> >> 2. The fault tolerance strategy and the Slow task detection strategy are 
> >> coupled
> I don't think the fault tolerance and slow task detecting are coupled.
> If a task fails while the ExecutionVertex still has a task in progress,
> there is no need to start new executions for the vertex in the perspective
> of fault tolerance. If the remaining task is slow, in the next slow task
> detecting, a speculative execution will be created and deployed for it.
> This, however, is a normal speculative execution process rather than a
> failure recovery process. In this way, the fault tolerance and slow task
> detecting work without knowing each other and the job can still recover
> from failures and guarantee there are speculative executions for slow tasks.
>
> >> 3. Default value of 
> >> `slow-task-detector.execution-time.baseline-lower-bound` is too small
> From what I see in production and knowing from users, there are many
> batch jobs of a relatively small scale (a few terabytes, hundreds of
> gigabytes). Tasks of these jobs can finish in minutes, so that a
> `1 min` lowbound is large enough. Besides that, I think the out-of-box
> experience is more important for users running small scale jobs.
>
> Thanks,
> Zhu
>
> Guowei Ma <[email protected]> 于2022年4月28日周四 17:55写道：
>>
>> Hi, zhu
>>
>> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think
>> it's ok, I just have 3 small questions
>>
>> 1. How to judge whether the Execution Vertex belongs to a slow task.
>> The current calculation method is: the current timestamp minus the
>> timestamp of the execution deployment. If the execution time of this
>> execution exceeds the baseline, then it is judged as a slow task. Normally
>> this is no problem. But if an execution fails, the time may not be
>> accurate. For example, the baseline is 59 minutes, and a task fails after
>> 56 minutes of execution. In the worst case, it may take an additional 59
>> minutes to discover that the task is a slow task.
>>
>> 2. Speculative Scheduler's fault tolerance strategy.
>> The strategy in FLIP is: if the Execution Vertex can be executed, even if
>> the execution fails, the fault tolerance strategy will not be adopted.
>> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an
>> execution. But isn't this dependency a bit too strong? To some extent, the
>> fault tolerance strategy and the Slow task detection strategy are coupled
>> together.
>>
>>
>> 3. The value of the default configuration
>> IMHO, prediction execution should only be required for relatively
>> large-scale, very time-consuming and long-term jobs.
>> If `slow-task-detector.execution-time.baseline-lower-bound` is too small,
>> is it possible for the system to always start some additional tasks that
>> have little effect? In the end, the user needs to reset this default
>> configuration. Is it possible to consider a larger configuration. Of
>> course, this part is best to listen to the suggestions of other community
>> users.
>>
>> Best,
>> Guowei
>>
>>
>> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <[email protected]>
>> wrote:
>>
>> > +1 for the feature.
>> >
>> > Mang Zhang <[email protected]> 于2022年4月28日周四 11:36写道：
>> >
>> > > Hi zhu:
>> > >
>> > >
>> > >     This sounds like a great job! Thanks for your great job.
>> > >     In our company, there are already some jobs using Flink Batch,
>> > >     but everyone knows that the offline cluster has a lot more load than
>> > > the online cluster, and the failure rate of the machine is also much
>> > higher.
>> > >     If this work is done, we'd love to use it, it's simply awesome for
>> > our
>> > > flink users.
>> > >     thanks again!
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > Best regards,
>> > > Mang Zhang
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > At 2022-04-27 10:46:06, "Zhu Zhu" <[email protected]> wrote:
>> > > >Hi everyone,
>> > > >
>> > > >More and more users are running their batch jobs on Flink nowadays.
>> > > >One major problem they encounter is slow tasks running on hot/bad
>> > > >nodes, resulting in very long and uncontrollable execution time of
>> > > >batch jobs. This problem is a pain or even unacceptable in
>> > > >production. Many users have been asking for a solution for it.
>> > > >
>> > > >Therefore, I'd like to revive the discussion of speculative
>> > > >execution to solve this problem.
>> > > >
>> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
>> > > >discussions to refine the design[1]. We also implemented a PoC[2]
>> > > >and verified it using TPC-DS benchmarks and production jobs.
>> > > >
>> > > >Looking forward to your feedback!
>> > > >
>> > > >Thanks,
>> > > >Zhu
>> > > >
>> > > >[1]
>> > > >
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> > > >[2]
>> > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
>> > > >
>> > > >
>> > > >刘建刚 <[email protected]> 于2021年12月13日周一 11:38写道：
>> > > >
>> > > >> Any progress on the feature? We have the same requirement in our
>> > > company.
>> > > >> Since the soft and hard environment can be complex, it is normal to
>> > see
>> > > a
>> > > >> slow task which determines the execution time of the flink job.
>> > > >>
>> > > >> <[email protected]> 于2021年6月20日周日 22:35写道：
>> > > >>
>> > > >> > Hi everyone,
>> > > >> >
>> > > >> > I would like to kick off a discussion on speculative execution for
>> > > batch
>> > > >> > job.
>> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do this
>> > > and
>> > > >> > some improvement proposals for the new design.
>> > > >> > It would be great to resolve the problem of long tail task in batch
>> > > job.
>> > > >> > Please let me know your thoughts. Thanks.
>> > > >> >   Regards,
>> > > >> > wangwj
>> > > >> > [1]
>> > > >> >
>> > > >>
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> > > >> >
>> > > >>
>> > >
>> >

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Reply via email to