On Fri, 5 Jul 2024 at 01:44, Arun Ravi <arunrav...@gmail.com> wrote:

> Hi Rajesh,
>
> We use it production at scale. We run spark on kubernetes on aws cloud and
> here are the key things that we do
> 1) we run driver on on-demand node
> 2) we have configured decommission along with fallback option on to S3,
> try the latest single zone S3 for this.
>

you mean s3 express? through the s3a connector ?

if so I'd love to know your performance statistics which you can get
printed when a process exits

fs.iostatistics.logging.level = info

if you are actually reading data from it -Upgrade to parquet 14.1 and turn
on vector IO; it really takes advantage of the low latency and improved
bandwidth.

> 3) We use pvc aware scheduling, ie spark ensures executors try to reuse
> available storage volumes created by the driver before requesting for a new
> one.
> 4) we have enabled kubernetes shuffle io wrapper plugin, this allows new
> executors to re-register shuffle blocks that it identifies in the reused
> pvc. This feature ensures shuffles from lost executors are served by new
> executor that refuses the disk.
> 5) we also configure to retain decommissioned executor details so that
> spark can ignore intermittent shuffle fetch failures.
>
> Some of these are best effort, you could also tune number of threads
> needed for decommissioning etc based on your workload and run environment.
>
> On Thu, 27 Jun 2024, 09:03 Rajesh Mahindra, <rjshmh...@gmail.com> wrote:
>
>> Hi folks,
>>
>> I am planning to leverage the "Spark Decommission" feature in production
>> since our company uses SPOT instances on Kubernetes. I wanted to get a
>> sense of how stable the feature is for production usage and if any one has
>> thoughts around trying it out in production, especially in kubernetes
>> environment.
>>
>> Thanks,
>> Rajesh
>>
>>

Reply via email to