[ https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332438#comment-17332438 ]
wangwj edited comment on FLINK-10644 at 5/1/21, 2:45 PM: --------------------------------------------------------- [~trohrmann] Hi,I have implemented this feature, and it has a very significant effect in our product cluster. We are colleagues of Alibaba, I will talk with you in detail on DingTalk. was (Author: wangwj): [~trohrmann] Hi,I have implemented this feature, and it has a very significant effect in our product cluster. We are Alibaba's colleagues, I will talk with you in detail on DingTalk. > Batch Job: Speculative execution > -------------------------------- > > Key: FLINK-10644 > URL: https://issues.apache.org/jira/browse/FLINK-10644 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination > Reporter: JIN SUN > Assignee: BoWang > Priority: Major > Labels: stale-assigned > > Strugglers/outlier are tasks that run slower than most of the all tasks in a > Batch Job, this somehow impact job latency, as pretty much this straggler > will be in the critical path of the job and become as the bottleneck. > Tasks may be slow for various reasons, including hardware degradation, or > software mis-configuration, or noise neighboring. It's hard for JM to predict > the runtime. > To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark > has *_speculative execution_*. Speculative execution is a health-check > procedure that checks for tasks to be speculated, i.e. running slower in a > ExecutionJobVertex than the median of all successfully completed tasks in > that EJV, Such slow tasks will be re-submitted to another TM. It will not > stop the slow tasks, but run a new copy in parallel. And will kill the others > if one of them complete. > This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be > append later. -- This message was sent by Atlassian Jira (v8.3.4#803005)