Re: Dataproc serverless for Spark

Holden Karau Mon, 28 Nov 2022 10:11:21 -0800

This sounds like a great question for the Google DataProc folks (I know
there was some interesting work being done around it but I left before it
was finished so I don't want to provide a possibly incorrect answer).


If your a GCP customer try reaching out to their support for details.

On Mon, Nov 21, 2022 at 1:47 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> I have not used standalone for a good while. The standard dataproc uses
> YARN as the resource manager. The vanilla dataproc is Google's answer to
> Hadoop on the cloud. Move your analytics workload from on-premise to Cloud
> with little effort with the same look and feel. Google then introduced  
> dynamic
> allocation of resources to cater for those apps that could not be easily
> migrated to Kubernetes (GKE). so the  doc states that  without dynamic
> allocation, it only asks for containers at the beginning of the job. With
> dynamic allocation, it will remove containers, or ask for new ones, as
> necessary. This is still using YARN. See here
> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>
>
> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>
>  This
> approach was as not necessarily very successful as adding executors
> dynamically for larger workloads could freeze the spark application itself.
> Reading the doc it says startup time for serverless is 60 seconds compared
> to dataproc on Compute engine (the one you setup your own spark cluster on
> dataproc tin boxes) of 90 seconds
>
> Dataproc serverless for Spark autoscaling
> <https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling> makes
> a reference to  "Dataproc Serverless autoscaling is the default behavior,
> and uses Spark dynamic resource allocation
> <https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation>
>  to
> determine whether, how, and when to scale your workload" So the key point
> is Not standalone mode but generally references to "Spark provides a
> mechanism to dynamically adjust the resources your application occupies
> based on the workload. This means that your application may give resources
> back to the cluster if they are no longer used and request them again later
> when there is demand. This feature is particularly useful if multiple
> applications share resources in your Spark cluster."
>
> Is'nt this the standard Spark resource allocation? So why has this
> suddenly been elevated from Spark 3.2?
>
> Someone may give a more qualified answer here :)
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On Mon, 21 Nov 2022 at 17:32, Stephen Boesch <java...@gmail.com> wrote:
>
>> Out of curiosity : are there functional limitations in Spark Standalone
>> that are of concern?  Yarn is more configurable for running non-spark
>> workloads and how to run multiple spark jobs in parallel. But for a single
>> spark job it seems standalone launches more quickly and does not miss any
>> features. Are there specific limitations you are aware of / run into?
>>
>> stephen b
>>
>> On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have not tested this myself but Google have brought up *Dataproc 
>>> Serverless
>>> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch
>>> workloads without requiring you to provision and manage your own cluster.
>>> Specify workload parameters, and then submit the workload to the Dataproc
>>> Serverless service. The service will run the workload on a managed compute
>>> infrastructure, autoscaling resources as needed. Dataproc Serverless
>>> charges apply only to the time when the workload is executing. Google
>>> Dataproc is similar to Amazon EMR
>>>
>>> So in short you don't need to provision your own Dataproc cluster etc.
>>> One thing Inoticed from release doc
>>> <https://cloud.google.com/dataproc-serverless/docs/overview>is that the
>>> resource management is *spark based a*s opposed to standard Dataproc
>>> which iis YARN based. It is available for Spark 3.2. My assumption is
>>> that by Spark based it means that spark is running in standalone mode. Has
>>> there been much improvement in release 3.2 for standalone mode?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Dataproc serverless for Spark

Reply via email to