On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin <r...@databricks.com> wrote:
> We tried enabling blacklisting for some customers and in the cloud, very > quickly they end up having 0 executors due to various transient errors. So > unfortunately I think the current implementation is terrible for cloud > deployments, and shouldn't be on by default. The heart of the issue is that > the current implementation is not great at dealing with transient errors vs > catastrophic errors. > +1. It contains the assumption that "Blacklisting is the solution", when really "reporting to something which can opt to destroy the node and request a new one" is better Having some way to report those failures to a monitor like that can combine app-level failure detection with cloud-infra reaction. it's also interesting to look at more complex failure evalators, where the Φ Accrual Failure Detector is an interesting option http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf Apparently you can use this with Akka: https://manuel.bernhardt.io/2017/07/26/a-new-adaptive-accrual-failure-detector-for-akka/ again, making this something where people can experiment with algorithms is a nice way to let interested parties explore the options in different environments