Hi Steve,

Thanks for your feedback. From your email, I could gather the following two
important points:

   1. Report failures to something (cluster manager) which can opt to
   destroy the node and request a new one
   2. Pluggable failure detection algorithms

Regarding #1, current blacklisting implementation does report blacklist
status to Yarn here
<https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L126>,
which can choose to take appropriate action based on failures across
different applications (though it seems it doesn't currently). This doesn't
work in static allocation though and for other cluster managers. Those
issues are still open:

   - https://issues.apache.org/jira/browse/SPARK-24016
   - https://issues.apache.org/jira/browse/SPARK-19755
   - https://issues.apache.org/jira/browse/SPARK-23485

Regarding #2, that is a good point but I think that is optional and may not
be tied to enabling the blacklisting feature in the current form.

Coming back to the concerns raised by Reynold, Chris and Steve, it seems
that there are at least two tasks that we need to complete before we decide
to enable blacklisting by default in it's current form:

   1. Avoid resource starvation because of blacklisting
   2. Use exponential backoff for blacklisting instead of a configurable
   threshold
   3. Report blacklisting status to all cluster managers (I am not sure if
   this is necessary to move forward though)

Thanks for all the feedback. Please let me know if there are other concerns
that we would like to resolve before enabling blacklisting.

Thanks,
Ankur

On Tue, Apr 2, 2019 at 2:45 AM Steve Loughran <ste...@cloudera.com.invalid>
wrote:

>
>
> On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin <r...@databricks.com> wrote:
>
>> We tried enabling blacklisting for some customers and in the cloud, very
>> quickly they end up having 0 executors due to various transient errors. So
>> unfortunately I think the current implementation is terrible for cloud
>> deployments, and shouldn't be on by default. The heart of the issue is that
>> the current implementation is not great at dealing with transient errors vs
>> catastrophic errors.
>>
>
> +1.
>
> It contains the assumption that "Blacklisting is the solution", when
> really "reporting to something which can opt to destroy the node and
> request a new one" is better
>
> Having some way to report those failures to a monitor like that can
> combine app-level failure detection with cloud-infra reaction.
>
> it's also interesting to look at more complex failure evalators, where
> the Φ Accrual Failure Detector is an interesting option
>
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf
>
> Apparently you can use this with Akka:
> https://manuel.bernhardt.io/2017/07/26/a-new-adaptive-accrual-failure-detector-for-akka/
>
> again, making this something where people can experiment with algorithms
> is a nice way to let interested parties explore the options in different
> environments
>
>

Reply via email to