On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta wrote:
> Hi Steve,
>
> Thanks for your feedback. From your email, I could gather the following
> two important points:
>
>1. Report failures to something (cluster manager) which can opt to
>destroy the node and request a new one
>2. Pluggable
Hi Steve,
Thanks for your feedback. From your email, I could gather the following two
important points:
1. Report failures to something (cluster manager) which can opt to
destroy the node and request a new one
2. Pluggable failure detection algorithms
Regarding #1, current blacklisting
On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin wrote:
> We tried enabling blacklisting for some customers and in the cloud, very
> quickly they end up having 0 executors due to various transient errors. So
> unfortunately I think the current implementation is terrible for cloud
> deployments, and sh
Thanks for your thoughts Chris! Please find my response below:
- Rather than a fixed timeout, could we do some sort of exponential
backoff? Start with a 10 or 20 second blacklist and increase from there?
The nodes with catastrophic errors should quickly hit long blacklist
intervals.
- +1 I like th
Hey Ankur,
I think the significant decrease in "spark.blacklist.timeout" (1 hr down to
5 minutes) in your updated suggestion is the key here.
Looking at a few *successful* runs of the application I was debugging, here
are the error rates when I did *not* have blacklisting enabled:
Run A: 8 execu
Hi Chris,
Thanks for sending over the example. As far as I can understand, it seems
that this would not have been a problem if
"spark.blacklist.application.maxFailedTasksPerExecutor" was set to a higher
threshold, as mentioned in my previous email.
Though, with 8/7 executors and 2 failedTasksPerE
Thanks Reynold! That is certainly useful to know.
@Chris Will it be possible for you to send out those details if you still
have them or better create a JIRA, so someone can work on those
improvements. If there is already a JIRA, can you please provide a link to
the same.
Additionally, if the con
We tried enabling blacklisting for some customers and in the cloud, very
quickly they end up having 0 executors due to various transient errors. So
unfortunately I think the current implementation is terrible for cloud
deployments, and shouldn't be on by default. The heart of the issue is that t
Hi all,
This is a follow-on to my PR: https://github.com/apache/spark/pull/24208,
where I aimed to enable blacklisting for fetch failure by default. From the
comments, there is interest in the community to enable overall blacklisting
feature by default. I have listed down 3 different things that w