Hi Rajesh, We use it production at scale. We run spark on kubernetes on aws cloud and here are the key things that we do 1) we run driver on on-demand node 2) we have configured decommission along with fallback option on to S3, try the latest single zone S3 for this. 3) We use pvc aware scheduling, ie spark ensures executors try to reuse available storage volumes created by the driver before requesting for a new one. 4) we have enabled kubernetes shuffle io wrapper plugin, this allows new executors to re-register shuffle blocks that it identifies in the reused pvc. This feature ensures shuffles from lost executors are served by new executor that refuses the disk. 5) we also configure to retain decommissioned executor details so that spark can ignore intermittent shuffle fetch failures.
Some of these are best effort, you could also tune number of threads needed for decommissioning etc based on your workload and run environment. On Thu, 27 Jun 2024, 09:03 Rajesh Mahindra, <rjshmh...@gmail.com> wrote: > Hi folks, > > I am planning to leverage the "Spark Decommission" feature in production > since our company uses SPOT instances on Kubernetes. I wanted to get a > sense of how stable the feature is for production usage and if any one has > thoughts around trying it out in production, especially in kubernetes > environment. > > Thanks, > Rajesh > >