I mean that local[*] = all cores on the machines, whereas in your example you seem to be choosing 8 cores per executor in the distributed case. You'd have 12 cores in your local case - which is still less than 2x8, but just the kind of thing to consider when comparing these setups.
Indeed, how well something parallelizes depends wholly on the code, input and cluster. There are trivial examples that parallelize perfectly, that are equally unuseful to you, like SparkPi. You can also construct jobs that will never ever be faster on a cluster (very small computation). What matters is understanding how your real problem executes. On Fri, Sep 25, 2020 at 10:26 AM javaguy Java <javagu...@gmail.com> wrote: > > Thanks - that's great I'll check out both spark-bench and SparkPi. > > I do have more than 8 cores in the local setup. 24 cores in total (12 per > machine). > > However on AWS with the same cluster setup, that is not the case; I chose > Medium size instances hoping that a much smaller instance since would show me > the benefits of the Spark Cluster. > > Perhaps I'm not making it clear but I'm not too interested in understanding > and optimising someone else's code that has no material value to me; I'm > interested in seeing a simple example of something working that I can then > carry across to my own datasets with a view to adopting the platform. > > Thx > > > > On Fri, Sep 25, 2020 at 2:29 PM Sean Owen <sro...@gmail.com> wrote: >> >> Maybe the better approach is to understand why your job isnt scaling - >> what does the UI show? are the resources actually the same? for >> example do you have more than 8 cores in the local setup? >> Is there enough parallelism? for example it doesn't look like the >> small input is repartitioned to at least the cluster parallelism / >> default parallelism. >> >> Something that should trivially parallelize? the SparkPi example >> >> You can try tools like https://codait.github.io/spark-bench/ to >> generate large workloads. >> >> On Fri, Sep 25, 2020 at 1:03 AM javaguy Java <javagu...@gmail.com> wrote: >> > >> > Hi Sean, >> > >> > Thanks for your reply. >> > >> > I understand distribution and parallelism very well and have used it with >> > other products like GridGain and various master worker patterns etc; I >> > just don't have a simple example working with Apache Spark which is what I >> > am looking for. I know Spark doesn't follow the others parallelism >> > paradigm so I'm looking for a distributed example that illustrates Spark's >> > distribution capabilities very well - and correct I want the total wall >> > clock completion time to go down. >> > >> > I think you misunderstood one thing re: the several machines blurb I have >> > in my post. My spark cluster has the 2 of the same "identical" machines. >> > So its not a split of cores and memory. It's a doubling of cores and >> > memory. >> > >> > To recap, the spark cluster on my home network is running 2x Mac's with >> > 32GB ram EACH (so 64GB ram in total) with the same processor size on each; >> > however when I run this code example on just one mac + Spark Standalone + >> > local[*] it is faster >> > >> > I have subsequently moved my example to AWS and on AWS I'm running two >> > identical EC2 instances (so again double the RAM and Cores) co-located in >> > the same AZ and the spark cluster is still slower compared to spark >> > standalone on one of these EC2 instances :( >> > >> > Hence my posts to Spark user group. >> > >> > I'm not wedded to this Udemy course example; I wish someone could just >> > point me at an example with some quick code and a large public data set >> > and say this runs faster on a cluster than standalone. I'd be happy to >> > make a post myself for any new people interested in Spark. >> > >> > Thanks >> > >> > >> > >> > >> > >> > >> > >> > >> > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen <sro...@gmail.com> wrote: >> >> >> >> If you have the same amount of resource (cores, memory, etc) on one >> >> machine, that is pretty much always going to be faster than using >> >> those same resources split across several machines. >> >> Even if you have somewhat more resource available on a cluster, the >> >> distributed version could be slower if you, for example, are >> >> bottlenecking on network I/O and leaving some resources underutilized. >> >> >> >> Distributing isn't really going to make the workload consume less >> >> resource; on the contrary it makes it take more. However it might make >> >> the total wall-clock completion time go way down through parallelism. >> >> How much you benefit from parallelism really depends on the problem, >> >> the cluster, the input, etc. You may not see a speedup in this problem >> >> until you hit more scale or modify the job to distribute a little >> >> better, etc. >> >> >> >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java <javagu...@gmail.com> wrote: >> >> > >> >> > Hi, >> >> > >> >> > I made a post on stackoverflow that I can't seem to make any headway on >> >> > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster >> >> > >> >> > Before someone starts making suggestions on changing the code; note >> >> > that the code and example on the above post is from a Udemy course and >> >> > is not my code. I am looking to take this dataset and code and >> >> > executing the same on a cluster I am looking to see the value of Spark >> >> > by seeing results so that the job submitted to the Spark Cluster runs >> >> > in a faster time compared to Standalone. >> >> > >> >> > I am currently evaluating Spark and I've thus far spent about a month >> >> > of weekends of my free time trying to get a Spark Cluster to show me >> >> > improved performance in comparison to Spark Standalone but I am not >> >> > having success, and after spending so much time in this, I am now >> >> > looking for help from as I'm time constrained (in general I'm time >> >> > constrained, not for a project or deadline re: Spark). >> >> > >> >> > If anyone can comment on what I need to make my example work faster on >> >> > a spark cluster vs standalone I'd appreciate it. >> >> > >> >> > Alternatively if someone can point me to a simple code example + >> >> > dataset that works better and illustrates the power of distributed >> >> > spark I'd be happy to use that instead - I'm not wedded to this example >> >> > that I got from the course - I'm just looking for the simple 5 min to >> >> > 30 min example quick start that shows the power of Spark distributed >> >> > clusters. >> >> > >> >> > There's a higher level question here and one that is not obvious to >> >> > find an answer for. There are many examples on Spark out there, but >> >> > there is not a simple large dataset + code example that illustrates the >> >> > performance gain of Spark's cluster and distributed computing benefits >> >> > vs just a single local standalone; which is what someone in my position >> >> > is looking for (someone who makes architectural and platform decisions >> >> > and is bandwidth / time constrained and wants to see the power and >> >> > advantages of Spark cluster and distributed computing without spending >> >> > weeks on the problem). >> >> > >> >> > I'm also willing to open this up to a consulting engagement if anyone >> >> > is interested as I'd expect it to be quick (either you have a simple >> >> > example that just needs to be setup etc or its easy for you to >> >> > demonstrate cluster performance > standalone for this dataset) >> >> > >> >> > Thx >> >> > >> >> > >> >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org