Hey Rajesh, Fromm y experience, it’s a stable feature, however you must keep in mind that it will not guarantee that you will not lose the data that is on the pods of the nodes getting a spot kill. Once you have a spot a kill, you have 120s to give the node back to the cloud provider. This is when the decommission script will start and sometimes 120s is enough to migrate the shuffle/rdd blocks, and sometimes it’s not. It really depends on your workload and data at the end.
Best regards, Ahmed Khaldi Solutions Architect NetApp Limited. +33617424566 Mobile Phone kah...@netapp.com<mailto:pump...@netapp.com> From: Rajesh Mahindra <rjshmh...@gmail.com> Date: Tuesday, 18 June 2024 at 23:54 To: user@spark.apache.org <user@spark.apache.org> Subject: Spark Decommission Vous ne recevez pas souvent de courriers de la part de rjshmh...@gmail.com. Découvrez pourquoi cela est important<https://aka.ms/LearnAboutSenderIdentification> EXTERNAL EMAIL - USE CAUTION when clicking links or attachments Hi folks, I am planning to leverage the "Spark Decommission" feature in production since our company uses SPOT instances on Kubernetes. I wanted to get a sense of how stable the feature is for production usage and if any one has thoughts around trying it out in production, especially in kubernetes environment. Thanks, Rajesh