Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

Sean Owen Thu, 24 Sep 2020 12:59:32 -0700

If you have the same amount of resource (cores, memory, etc) on one
machine, that is pretty much always going to be faster than using
those same resources split across several machines.
Even if you have somewhat more resource available on a cluster, the
distributed version could be slower if you, for example, are
bottlenecking on network I/O and leaving some resources underutilized.


Distributing isn't really going to make the workload consume less
resource; on the contrary it makes it take more. However it might make
the total wall-clock completion time go way down through parallelism.
How much you benefit from parallelism really depends on the problem,
the cluster, the input, etc. You may not see a speedup in this problem
until you hit more scale or modify the job to distribute a little
better, etc.

On Thu, Sep 24, 2020 at 1:43 PM javaguy Java <javagu...@gmail.com> wrote:
>
> Hi,
>
> I made a post on stackoverflow that I can't seem to make any headway on
> https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster
>
> Before someone starts making suggestions on changing the code; note that the 
> code and example on the above post is from a Udemy course and is not my code. 
> I am looking to take this dataset and code and executing the same on a 
> cluster I am looking to see the value of Spark by seeing results so that the 
> job submitted to the Spark Cluster runs in a faster time compared to 
> Standalone.
>
> I am currently evaluating Spark and I've thus far spent about a month of 
> weekends of my free time trying to get a Spark Cluster to show me improved 
> performance in comparison to Spark Standalone but I am not having success, 
> and after spending so much time in this, I am now looking for help from as 
> I'm time constrained (in general I'm time constrained, not for a project or 
> deadline re: Spark).
>
> If anyone can comment on what I need to make my example work faster on a 
> spark cluster vs standalone I'd appreciate it.
>
> Alternatively if someone can point me to a simple code example + dataset that 
> works better and illustrates the power of distributed spark I'd be happy to 
> use that instead - I'm not wedded to this example that I got from the course 
> - I'm just looking for the simple 5 min to 30 min example quick start that 
> shows the power of Spark distributed clusters.
>
> There's a higher level question here and one that is not obvious to find an 
> answer for.  There are many examples on Spark out there, but there is not a 
> simple large dataset + code example that illustrates the performance gain of 
> Spark's cluster and distributed computing benefits vs just a single local 
> standalone; which is what someone in my position is looking for (someone who 
> makes architectural and platform decisions and is bandwidth / time 
> constrained and wants to see the power and advantages of Spark cluster and 
> distributed computing without spending weeks on the problem).
>
> I'm also willing to open this up to a consulting engagement if anyone is 
> interested as I'd expect it to be quick (either you have a simple example 
> that just needs to be setup etc or its easy for you to demonstrate cluster 
> performance > standalone for this dataset)
>
> Thx
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

Reply via email to