This sounds like a great question for the Google DataProc folks (I know there was some interesting work being done around it but I left before it was finished so I don't want to provide a possibly incorrect answer).
If your a GCP customer try reaching out to their support for details. On Mon, Nov 21, 2022 at 1:47 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > I have not used standalone for a good while. The standard dataproc uses > YARN as the resource manager. The vanilla dataproc is Google's answer to > Hadoop on the cloud. Move your analytics workload from on-premise to Cloud > with little effort with the same look and feel. Google then introduced > dynamic > allocation of resources to cater for those apps that could not be easily > migrated to Kubernetes (GKE). so the doc states that without dynamic > allocation, it only asks for containers at the beginning of the job. With > dynamic allocation, it will remove containers, or ask for new ones, as > necessary. This is still using YARN. See here > <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark> > > <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark> > This > approach was as not necessarily very successful as adding executors > dynamically for larger workloads could freeze the spark application itself. > Reading the doc it says startup time for serverless is 60 seconds compared > to dataproc on Compute engine (the one you setup your own spark cluster on > dataproc tin boxes) of 90 seconds > > Dataproc serverless for Spark autoscaling > <https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling> makes > a reference to "Dataproc Serverless autoscaling is the default behavior, > and uses Spark dynamic resource allocation > <https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation> > to > determine whether, how, and when to scale your workload" So the key point > is Not standalone mode but generally references to "Spark provides a > mechanism to dynamically adjust the resources your application occupies > based on the workload. This means that your application may give resources > back to the cluster if they are no longer used and request them again later > when there is demand. This feature is particularly useful if multiple > applications share resources in your Spark cluster." > > Is'nt this the standard Spark resource allocation? So why has this > suddenly been elevated from Spark 3.2? > > Someone may give a more qualified answer here :) > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > > On Mon, 21 Nov 2022 at 17:32, Stephen Boesch <java...@gmail.com> wrote: > >> Out of curiosity : are there functional limitations in Spark Standalone >> that are of concern? Yarn is more configurable for running non-spark >> workloads and how to run multiple spark jobs in parallel. But for a single >> spark job it seems standalone launches more quickly and does not miss any >> features. Are there specific limitations you are aware of / run into? >> >> stephen b >> >> On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I have not tested this myself but Google have brought up *Dataproc >>> Serverless >>> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch >>> workloads without requiring you to provision and manage your own cluster. >>> Specify workload parameters, and then submit the workload to the Dataproc >>> Serverless service. The service will run the workload on a managed compute >>> infrastructure, autoscaling resources as needed. Dataproc Serverless >>> charges apply only to the time when the workload is executing. Google >>> Dataproc is similar to Amazon EMR >>> >>> So in short you don't need to provision your own Dataproc cluster etc. >>> One thing Inoticed from release doc >>> <https://cloud.google.com/dataproc-serverless/docs/overview>is that the >>> resource management is *spark based a*s opposed to standard Dataproc >>> which iis YARN based. It is available for Spark 3.2. My assumption is >>> that by Spark based it means that spark is running in standalone mode. Has >>> there been much improvement in release 3.2 for standalone mode? >>> >>> Thanks >>> >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >> -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau