Hi Steve, Thanks for your feedback. From your email, I could gather the following two important points:
1. Report failures to something (cluster manager) which can opt to destroy the node and request a new one 2. Pluggable failure detection algorithms Regarding #1, current blacklisting implementation does report blacklist status to Yarn here <https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L126>, which can choose to take appropriate action based on failures across different applications (though it seems it doesn't currently). This doesn't work in static allocation though and for other cluster managers. Those issues are still open: - https://issues.apache.org/jira/browse/SPARK-24016 - https://issues.apache.org/jira/browse/SPARK-19755 - https://issues.apache.org/jira/browse/SPARK-23485 Regarding #2, that is a good point but I think that is optional and may not be tied to enabling the blacklisting feature in the current form. Coming back to the concerns raised by Reynold, Chris and Steve, it seems that there are at least two tasks that we need to complete before we decide to enable blacklisting by default in it's current form: 1. Avoid resource starvation because of blacklisting 2. Use exponential backoff for blacklisting instead of a configurable threshold 3. Report blacklisting status to all cluster managers (I am not sure if this is necessary to move forward though) Thanks for all the feedback. Please let me know if there are other concerns that we would like to resolve before enabling blacklisting. Thanks, Ankur On Tue, Apr 2, 2019 at 2:45 AM Steve Loughran <ste...@cloudera.com.invalid> wrote: > > > On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin <r...@databricks.com> wrote: > >> We tried enabling blacklisting for some customers and in the cloud, very >> quickly they end up having 0 executors due to various transient errors. So >> unfortunately I think the current implementation is terrible for cloud >> deployments, and shouldn't be on by default. The heart of the issue is that >> the current implementation is not great at dealing with transient errors vs >> catastrophic errors. >> > > +1. > > It contains the assumption that "Blacklisting is the solution", when > really "reporting to something which can opt to destroy the node and > request a new one" is better > > Having some way to report those failures to a monitor like that can > combine app-level failure detection with cloud-infra reaction. > > it's also interesting to look at more complex failure evalators, where > the Φ Accrual Failure Detector is an interesting option > > > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf > > Apparently you can use this with Akka: > https://manuel.bernhardt.io/2017/07/26/a-new-adaptive-accrual-failure-detector-for-akka/ > > again, making this something where people can experiment with algorithms > is a nice way to let interested parties explore the options in different > environments > >