On Fri, 5 Jul 2024 at 01:44, Arun Ravi <arunrav...@gmail.com> wrote: > Hi Rajesh, > > We use it production at scale. We run spark on kubernetes on aws cloud and > here are the key things that we do > 1) we run driver on on-demand node > 2) we have configured decommission along with fallback option on to S3, > try the latest single zone S3 for this. >
you mean s3 express? through the s3a connector ? if so I'd love to know your performance statistics which you can get printed when a process exits fs.iostatistics.logging.level = info if you are actually reading data from it -Upgrade to parquet 14.1 and turn on vector IO; it really takes advantage of the low latency and improved bandwidth. > 3) We use pvc aware scheduling, ie spark ensures executors try to reuse > available storage volumes created by the driver before requesting for a new > one. > 4) we have enabled kubernetes shuffle io wrapper plugin, this allows new > executors to re-register shuffle blocks that it identifies in the reused > pvc. This feature ensures shuffles from lost executors are served by new > executor that refuses the disk. > 5) we also configure to retain decommissioned executor details so that > spark can ignore intermittent shuffle fetch failures. > > Some of these are best effort, you could also tune number of threads > needed for decommissioning etc based on your workload and run environment. > > On Thu, 27 Jun 2024, 09:03 Rajesh Mahindra, <rjshmh...@gmail.com> wrote: > >> Hi folks, >> >> I am planning to leverage the "Spark Decommission" feature in production >> since our company uses SPOT instances on Kubernetes. I wanted to get a >> sense of how stable the feature is for production usage and if any one has >> thoughts around trying it out in production, especially in kubernetes >> environment. >> >> Thanks, >> Rajesh >> >>