Stage level scheduling does not allow you to change configs right now. This is something we thought about as follow on but have never implemented. How many tasks on the DL stage are you running? The typical case is run some etl lots of tasks... do mapPartitions and then run your DL stuff, before that mapPartitions you could do a repartition if necessary to get to exactly the number of tasks you want (20). That way even if maxExecutors=500 you will only ever need 20 or whatever you repartition to and spark isn't going to ask for more then that. Tom
On Thursday, November 3, 2022 at 11:10:31 AM CDT, Shay Elbaz <shay.el...@gm.com> wrote: #yiv8086956851 P {margin-top:0;margin-bottom:0;}Thanks again Artemis, I really appreciate it. I have watched the video but did not find an answer. Please bear with me just one more iteration 🙂 Maybe I'll be more specific:Suppose I start the application with maxExecutors=500, executors.cores=2, because that's the amount of resources needed for the ETL part. But for the DL part I only need 20 GPUs. SLS API only allows to set the resources per executor/task, so Spark would (try to) allocate up to 500 GPUs, assuming I configure the profile with 1 GPU per executor. So, the question is how do I limit the stage resources to 20 GPUs total? Thanks again,Shay From: Artemis User <arte...@dtechspace.com> Sent: Thursday, November 3, 2022 5:23 PM To: user@spark.apache.org <user@spark.apache.org> Subject: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs | ATTENTION: This email originated from outside of GM. | Shay, You may find this video helpful (with some API code samples that you are looking for). https://www.youtube.com/watch?v=JNQu-226wUc&t=171s. The issue here isn't how to limit the number of executors but to request for the right GPU-enabled executors dynamically. Those executors used in pre-GPU stages should be returned back to resource managers with dynamic resource allocation enabled (and with the right DRA policies). Hope this helps.. Unfortunately there isn't a lot of detailed docs for this topic since GPU acceleration is kind of new in Spark (not straightforward like in TF). I wish the Spark doc team could provide more details in the next release... On 11/3/22 2:37 AM, Shay Elbaz wrote: #yiv8086956851 #yiv8086956851 --p {margin-top:0;margin-bottom:0;}#yiv8086956851 Thanks Artemis. We are not using Rapids, but rather using GPUs through the Stage Level Scheduling feature with ResourceProfile. In Kubernetes you have to turn on shuffle tracking for dynamic allocation, anyhow.The question is how we can limit thenumber of executors when building a new ResourceProfile, directly (API) or indirectly (some advanced workaround). Thanks,Shay From: Artemis User<arte...@dtechspace.com> Sent: Thursday, November 3, 2022 1:16 AM To: user@spark.apache.org <user@spark.apache.org> Subject: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs | ATTENTION: This email originated from outside of GM. | Are you using Rapids for GPU support in Spark? Couple of options you may want to try: - In addition to dynamic allocation turned on, you may also need to turn on external shuffling service. - Sounds like you are using Kubernetes. In that case, you may also need to turn on shuffle tracking. - The "stages" are controlled by the APIs. The APIs for dynamic resource request (change of stage) do exist, but only for RDDs (e.g. TaskResourceRequest and ExecutorResourceRequest). On 11/2/22 11:30 AM, Shay Elbaz wrote: #yiv8086956851 #yiv8086956851 --p {margin-top:0;margin-bottom:0;}#yiv8086956851 Hi, Our typical applications need lessexecutors for a GPU stage than for a CPU stage. We are using dynamic allocation with stage level scheduling, and Spark tries to maximize the number of executors also during the GPU stage, causing a bit of resources chaos in the cluster. This forces us to use a lower value for 'maxExecutors' in the first place, at the cost of the CPU stages performance. Or try to solve this in the Kubernets scheduler level, which is not straightforward and doesn't feel like the right way to go. Is there a way to effectively use less executors in Stage Level Scheduling? The API does not seem to include such an option, but maybe there is some more advanced workaround? Thanks,Shay