Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-05-26 Thread Zhu Zhu
Hi everyone, Thank you for all the feedback on this FLIP! I will open a vote for it since there is no more concern. Thanks, Zhu Zhu Zhu 于2022年5月11日周三 12:29写道: > > Hi everyone, > > According to the discussion and updates of the blocklist > mechanism[1] (FLIP-224), I have updated FLIP-168 to make

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-05-10 Thread Zhu Zhu
Hi everyone, According to the discussion and updates of the blocklist mechanism[1] (FLIP-224), I have updated FLIP-168 to make decision on itself to block identified slow nodes. A new configuration is also added to control how long a slow node should be blocked. [1] https://lists.apache.org/threa

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-04-28 Thread Zhu Zhu
Thank you for all the feedback! @Guowei Ma Here's my thoughts for your questions: >> 1. How to judge whether the Execution Vertex belongs to a slow task. If a slow task fails and gets restarted, it may not be a slow task anymore. Especially given that the nodes of the slow task may have been black

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-04-28 Thread Guowei Ma
Hi, zhu Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think it's ok, I just have 3 small questions 1. How to judge whether the Execution Vertex belongs to a slow task. The current calculation method is: the current timestamp minus the timestamp of the execution deployment. I

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-04-28 Thread Jiangang Liu
+1 for the feature. Mang Zhang 于2022年4月28日周四 11:36写道: > Hi zhu: > > > This sounds like a great job! Thanks for your great job. > In our company, there are already some jobs using Flink Batch, > but everyone knows that the offline cluster has a lot more load than > the online cluster,

Re:Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-04-27 Thread Mang Zhang
Hi zhu: This sounds like a great job! Thanks for your great job. In our company, there are already some jobs using Flink Batch, but everyone knows that the offline cluster has a lot more load than the online cluster, and the failure rate of the machine is also much higher. If th

Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2022-04-26 Thread Zhu Zhu
Hi everyone, More and more users are running their batch jobs on Flink nowadays. One major problem they encounter is slow tasks running on hot/bad nodes, resulting in very long and uncontrollable execution time of batch jobs. This problem is a pain or even unacceptable in production. Many users ha

Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2021-12-12 Thread 刘建刚
Any progress on the feature? We have the same requirement in our company. Since the soft and hard environment can be complex, it is normal to see a slow task which determines the execution time of the flink job. 于2021年6月20日周日 22:35写道: > Hi everyone, > > I would like to kick off a discussion on s

[DISCUSS] FLIP-168: Speculative execution for Batch Job

2021-06-20 Thread wangwj03
Hi everyone, I would like to kick off a discussion on speculative execution for batch job. I have created FLIP-168 [1] that clarifies our motivation to do this and some improvement proposals for the new design. It would be great to resolve the problem of long tail task in batch job. Please let