Re: Spark decommission

Arun Ravi Thu, 04 Jul 2024 17:43:48 -0700

Hi Rajesh,

We use it production at scale. We run spark on kubernetes on aws cloud and
here are the key things that we do
1) we run driver on on-demand node
2) we have configured decommission along with fallback option on to S3, try
the latest single zone S3 for this.
3) We use pvc aware scheduling, ie spark ensures executors try to reuse
available storage volumes created by the driver before requesting for a new
one.
4) we have enabled kubernetes shuffle io wrapper plugin, this allows new
executors to re-register shuffle blocks that it identifies in the reused
pvc. This feature ensures shuffles from lost executors are served by new
executor that refuses the disk.
5) we also configure to retain decommissioned executor details so that
spark can ignore intermittent shuffle fetch failures.

Some of these are best effort, you could also tune number of threads needed
for decommissioning etc based on your workload and run environment.

On Thu, 27 Jun 2024, 09:03 Rajesh Mahindra, <rjshmh...@gmail.com> wrote:

> Hi folks,
>
> I am planning to leverage the "Spark Decommission" feature in production
> since our company uses SPOT instances on Kubernetes. I wanted to get a
> sense of how stable the feature is for production usage and if any one has
> thoughts around trying it out in production, especially in kubernetes
> environment.
>
> Thanks,
> Rajesh
>
>

Re: Spark decommission

Reply via email to