zhuzhurk commented on code in PR #20507:
URL: https://github.com/apache/flink/pull/20507#discussion_r941037269


##########
docs/content/docs/deployment/speculative_execution.md:
##########
@@ -0,0 +1,100 @@
+---
+title: Speculative Execution
+weight: 5
+type: docs
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Speculative Execution
+Apache Flink supports speculative execution of batch jobs.
+
+This page describes the background of speculative execution and how to use it.
+
+## Background
+Speculative execution is a mechanism to mitigate job slowness which is caused 
by problematic nodes. 
+A problematic node may have hardware problems, accident I/O busy, or high CPU 
load. These problems may
+make the hosted tasks run much slower than tasks on other nodes, and affects 
the overall execution time 
+of a batch job.
+
+In such cases, speculative execution will start a new attempt of the slow task 
on nodes that are not 
+detected as problematic/slow. The new attempt processes the same input data 
and produces the same data 
+as the old one. The old attempt will not be affected and will keep running. 
The first finished attempt
+will be admitted, its output will be seen and consumed by the downstream 
tasks, and the remaining attempts 
+will be canceled.
+
+To achieve this, Flink uses the slow task detector to detect slow tasks. The 
nodes that the slow tasks
+locate in will be identified as problematic nodes and get blocked via the 
blocklist handler. The scheduler 
+will create new attempts for the slow tasks and deploy them on nodes that are 
not blocked.
+
+## Usage
+This section describes how to use speculative execution, including how to 
enable it, how to tuning it, and
+how to develop/improve custom sources to work with speculative execution.
+
+{{< hint warning >}}
+Note: Flink does not support speculative execution of sinks yet and will 
support it in follow-up releases.
+{{< /hint >}}
+
+{{< hint warning >}}
+Note: Flink does not support speculative execution of DataSet jobs because 
DataSet will be deprecated 
+in near future. DataStream API is now the recommended low level API to develop 
Flink batch jobs.
+{{< /hint >}}
+
+### Enable Speculative Execution
+To enable speculative execution, you need to set the following configuration 
options:
+- `jobmanager.scheduler: AdaptiveBatch`
+    - Because only [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) supports 
speculative execution.

Review Comment:
   In my understanding, "speculative execution works along with adaptive batch 
scheduling" is similar to "only Adaptive Batch Scheduler supports speculative 
execution". This is a fact and we do not need to explain it in this user doc 
(but can do it in a blogpost which shares the design of speculative execution).



##########
docs/content/docs/deployment/speculative_execution.md:
##########
@@ -0,0 +1,100 @@
+---
+title: Speculative Execution
+weight: 5
+type: docs
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Speculative Execution
+Apache Flink supports speculative execution of batch jobs.
+
+This page describes the background of speculative execution and how to use it.
+
+## Background
+Speculative execution is a mechanism to mitigate job slowness which is caused 
by problematic nodes. 
+A problematic node may have hardware problems, accident I/O busy, or high CPU 
load. These problems may
+make the hosted tasks run much slower than tasks on other nodes, and affects 
the overall execution time 
+of a batch job.
+
+In such cases, speculative execution will start a new attempt of the slow task 
on nodes that are not 
+detected as problematic/slow. The new attempt processes the same input data 
and produces the same data 
+as the old one. The old attempt will not be affected and will keep running. 
The first finished attempt
+will be admitted, its output will be seen and consumed by the downstream 
tasks, and the remaining attempts 
+will be canceled.
+
+To achieve this, Flink uses the slow task detector to detect slow tasks. The 
nodes that the slow tasks
+locate in will be identified as problematic nodes and get blocked via the 
blocklist handler. The scheduler 

Review Comment:
   I think the right way is either "that the slow tasks locate in" or "where 
the slow tasks locate".



##########
docs/content/docs/deployment/speculative_execution.md:
##########
@@ -0,0 +1,100 @@
+---
+title: Speculative Execution
+weight: 5
+type: docs
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Speculative Execution
+Apache Flink supports speculative execution of batch jobs.
+
+This page describes the background of speculative execution and how to use it.
+
+## Background
+Speculative execution is a mechanism to mitigate job slowness which is caused 
by problematic nodes. 
+A problematic node may have hardware problems, accident I/O busy, or high CPU 
load. These problems may
+make the hosted tasks run much slower than tasks on other nodes, and affects 
the overall execution time 
+of a batch job.
+
+In such cases, speculative execution will start a new attempt of the slow task 
on nodes that are not 
+detected as problematic/slow. The new attempt processes the same input data 
and produces the same data 
+as the old one. The old attempt will not be affected and will keep running. 
The first finished attempt
+will be admitted, its output will be seen and consumed by the downstream 
tasks, and the remaining attempts 
+will be canceled.
+
+To achieve this, Flink uses the slow task detector to detect slow tasks. The 
nodes that the slow tasks
+locate in will be identified as problematic nodes and get blocked via the 
blocklist handler. The scheduler 
+will create new attempts for the slow tasks and deploy them on nodes that are 
not blocked.
+
+## Usage
+This section describes how to use speculative execution, including how to 
enable it, how to tuning it, and
+how to develop/improve custom sources to work with speculative execution.
+
+{{< hint warning >}}
+Note: Flink does not support speculative execution of sinks yet and will 
support it in follow-up releases.
+{{< /hint >}}
+
+{{< hint warning >}}
+Note: Flink does not support speculative execution of DataSet jobs because 
DataSet will be deprecated 
+in near future. DataStream API is now the recommended low level API to develop 
Flink batch jobs.
+{{< /hint >}}
+
+### Enable Speculative Execution
+To enable speculative execution, you need to set the following configuration 
options:
+- `jobmanager.scheduler: AdaptiveBatch`
+    - Because only [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) supports 
speculative execution.
+- `jobmanager.adaptive-batch-scheduler.speculative.enabled: true`
+
+### Tuning Configuration
+To make speculative execution work better for different jobs, you can tune 
below configuration options of the scheduler:
+- 
[`jobmanager.adaptive-batch-scheduler.speculative.max-concurrent-executions`]({{<
 ref "docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-speculative-max-concurrent-e): 
+Controls the maximum number of execution attempts of each operator that can 
execute concurrently, including the original one and speculative ones.
+- 
[`jobmanager.adaptive-batch-scheduler.speculative.block-slow-node-duration`]({{<
 ref "docs/deployment/config" 
>}}#jobmanager-adaptive-batch-scheduler-speculative-block-slow-node): 
+Controls how long an detected slow node should be blocked for.
+
+You can also need to tune below configuration options to control the slow task 
detector:

Review Comment:
   changed to `You can also tune`



##########
docs/content/docs/deployment/speculative_execution.md:
##########
@@ -0,0 +1,100 @@
+---
+title: Speculative Execution
+weight: 5
+type: docs
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Speculative Execution
+Apache Flink supports speculative execution of batch jobs.
+
+This page describes the background of speculative execution and how to use it.
+
+## Background
+Speculative execution is a mechanism to mitigate job slowness which is caused 
by problematic nodes. 
+A problematic node may have hardware problems, accident I/O busy, or high CPU 
load. These problems may
+make the hosted tasks run much slower than tasks on other nodes, and affects 
the overall execution time 
+of a batch job.
+
+In such cases, speculative execution will start a new attempt of the slow task 
on nodes that are not 
+detected as problematic/slow. The new attempt processes the same input data 
and produces the same data 
+as the old one. The old attempt will not be affected and will keep running. 
The first finished attempt
+will be admitted, its output will be seen and consumed by the downstream 
tasks, and the remaining attempts 
+will be canceled.
+
+To achieve this, Flink uses the slow task detector to detect slow tasks. The 
nodes that the slow tasks
+locate in will be identified as problematic nodes and get blocked via the 
blocklist handler. The scheduler 

Review Comment:
   > blocklist handler -> blocklist mechanism
   
   It's fine to and I do not have much preference for it. But I feel that it's 
better to not link a FLIP in this doc which is targeting for application users, 
as long as it is easy to understand. If it is not easy to understand, it's 
better to add a section in this doc to explain what it is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to