Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-03 Thread Steve Loughran
On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta wrote: > Hi Steve, > > Thanks for your feedback. From your email, I could gather the following > two important points: > >1. Report failures to something (cluster manager) which can opt to >destroy the node and request a new one >2. Pluggable

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-02 Thread Ankur Gupta
Hi Steve, Thanks for your feedback. From your email, I could gather the following two important points: 1. Report failures to something (cluster manager) which can opt to destroy the node and request a new one 2. Pluggable failure detection algorithms Regarding #1, current blacklisting

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-02 Thread Steve Loughran
On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin wrote: > We tried enabling blacklisting for some customers and in the cloud, very > quickly they end up having 0 executors due to various transient errors. So > unfortunately I think the current implementation is terrible for cloud > deployments, and sh

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-01 Thread Ankur Gupta
Thanks for your thoughts Chris! Please find my response below: - Rather than a fixed timeout, could we do some sort of exponential backoff? Start with a 10 or 20 second blacklist and increase from there? The nodes with catastrophic errors should quickly hit long blacklist intervals. - +1 I like th

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-01 Thread Chris Stevens
Hey Ankur, I think the significant decrease in "spark.blacklist.timeout" (1 hr down to 5 minutes) in your updated suggestion is the key here. Looking at a few *successful* runs of the application I was debugging, here are the error rates when I did *not* have blacklisting enabled: Run A: 8 execu

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-01 Thread Ankur Gupta
Hi Chris, Thanks for sending over the example. As far as I can understand, it seems that this would not have been a problem if "spark.blacklist.application.maxFailedTasksPerExecutor" was set to a higher threshold, as mentioned in my previous email. Though, with 8/7 executors and 2 failedTasksPerE

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Ankur Gupta
Thanks Reynold! That is certainly useful to know. @Chris Will it be possible for you to send out those details if you still have them or better create a JIRA, so someone can work on those improvements. If there is already a JIRA, can you please provide a link to the same. Additionally, if the con

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Reynold Xin
We tried enabling blacklisting for some customers and in the cloud, very quickly they end up having 0 executors due to various transient errors. So unfortunately I think the current implementation is terrible for cloud deployments, and shouldn't be on by default. The heart of the issue is that t

[DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-28 Thread Ankur Gupta
Hi all, This is a follow-on to my PR: https://github.com/apache/spark/pull/24208, where I aimed to enable blacklisting for fetch failure by default. From the comments, there is interest in the community to enable overall blacklisting feature by default. I have listed down 3 different things that w