Hi all,

This is a follow-on to my PR: https://github.com/apache/spark/pull/24208,
where I aimed to enable blacklisting for fetch failure by default. From the
comments, there is interest in the community to enable overall blacklisting
feature by default. I have listed down 3 different things that we can do
and would like to gather feedback and see if anyone has objections with
regards to this. Otherwise, I will just create a PR for the same.

1. *Enable blacklisting feature by default*. The blacklisting feature was
added as part of SPARK-8425 and is available since 2.2.0. This feature was
deemed experimental and was disabled by default. The feature blacklists an
executor/node from running a particular task, any task in a particular
stage or all tasks in application based on number of failures. There are
various configurations available which control those thresholds.
Additionally, the executor/node is only blacklisted for a configurable time
period. The idea is to enable blacklisting feature with existing defaults,
which are following:

   1. spark.blacklist.task.maxTaskAttemptsPerExecutor = 1
   2. spark.blacklist.task.maxTaskAttemptsPerNode = 2
   3. spark.blacklist.stage.maxFailedTasksPerExecutor = 2
   4. spark.blacklist.stage.maxFailedExecutorsPerNode = 2
   5. spark.blacklist.application.maxFailedTasksPerExecutor = 2
   6. spark.blacklist.application.maxFailedExecutorsPerNode = 2
   7. spark.blacklist.timeout = 1 hour

2. *Kill blacklisted executors/nodes by default*. This feature was added as
part of SPARK-16554 and is available since 2.2.0. This is a follow-on
feature to blacklisting, such that if an executor/node is blacklisted for
the application, then it also terminates all running tasks on that executor
for faster failure recovery.

3. *Remove legacy blacklisting timeout config*
: spark.scheduler.executorTaskBlacklistTime

Thanks,
Ankur

Reply via email to