Hi all, This is a follow-on to my PR: https://github.com/apache/spark/pull/24208, where I aimed to enable blacklisting for fetch failure by default. From the comments, there is interest in the community to enable overall blacklisting feature by default. I have listed down 3 different things that we can do and would like to gather feedback and see if anyone has objections with regards to this. Otherwise, I will just create a PR for the same.
1. *Enable blacklisting feature by default*. The blacklisting feature was added as part of SPARK-8425 and is available since 2.2.0. This feature was deemed experimental and was disabled by default. The feature blacklists an executor/node from running a particular task, any task in a particular stage or all tasks in application based on number of failures. There are various configurations available which control those thresholds. Additionally, the executor/node is only blacklisted for a configurable time period. The idea is to enable blacklisting feature with existing defaults, which are following: 1. spark.blacklist.task.maxTaskAttemptsPerExecutor = 1 2. spark.blacklist.task.maxTaskAttemptsPerNode = 2 3. spark.blacklist.stage.maxFailedTasksPerExecutor = 2 4. spark.blacklist.stage.maxFailedExecutorsPerNode = 2 5. spark.blacklist.application.maxFailedTasksPerExecutor = 2 6. spark.blacklist.application.maxFailedExecutorsPerNode = 2 7. spark.blacklist.timeout = 1 hour 2. *Kill blacklisted executors/nodes by default*. This feature was added as part of SPARK-16554 and is available since 2.2.0. This is a follow-on feature to blacklisting, such that if an executor/node is blacklisted for the application, then it also terminates all running tasks on that executor for faster failure recovery. 3. *Remove legacy blacklisting timeout config* : spark.scheduler.executorTaskBlacklistTime Thanks, Ankur