Maybe the better approach is to understand why your job isnt scaling - what does the UI show? are the resources actually the same? for example do you have more than 8 cores in the local setup? Is there enough parallelism? for example it doesn't look like the small input is repartitioned to at least the cluster parallelism / default parallelism.
Something that should trivially parallelize? the SparkPi example You can try tools like https://codait.github.io/spark-bench/ to generate large workloads. On Fri, Sep 25, 2020 at 1:03 AM javaguy Java <javagu...@gmail.com> wrote: > > Hi Sean, > > Thanks for your reply. > > I understand distribution and parallelism very well and have used it with > other products like GridGain and various master worker patterns etc; I just > don't have a simple example working with Apache Spark which is what I am > looking for. I know Spark doesn't follow the others parallelism paradigm so > I'm looking for a distributed example that illustrates Spark's distribution > capabilities very well - and correct I want the total wall clock completion > time to go down. > > I think you misunderstood one thing re: the several machines blurb I have in > my post. My spark cluster has the 2 of the same "identical" machines. So > its not a split of cores and memory. It's a doubling of cores and memory. > > To recap, the spark cluster on my home network is running 2x Mac's with 32GB > ram EACH (so 64GB ram in total) with the same processor size on each; however > when I run this code example on just one mac + Spark Standalone + local[*] it > is faster > > I have subsequently moved my example to AWS and on AWS I'm running two > identical EC2 instances (so again double the RAM and Cores) co-located in the > same AZ and the spark cluster is still slower compared to spark standalone on > one of these EC2 instances :( > > Hence my posts to Spark user group. > > I'm not wedded to this Udemy course example; I wish someone could just point > me at an example with some quick code and a large public data set and say > this runs faster on a cluster than standalone. I'd be happy to make a post > myself for any new people interested in Spark. > > Thanks > > > > > > > > > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen <sro...@gmail.com> wrote: >> >> If you have the same amount of resource (cores, memory, etc) on one >> machine, that is pretty much always going to be faster than using >> those same resources split across several machines. >> Even if you have somewhat more resource available on a cluster, the >> distributed version could be slower if you, for example, are >> bottlenecking on network I/O and leaving some resources underutilized. >> >> Distributing isn't really going to make the workload consume less >> resource; on the contrary it makes it take more. However it might make >> the total wall-clock completion time go way down through parallelism. >> How much you benefit from parallelism really depends on the problem, >> the cluster, the input, etc. You may not see a speedup in this problem >> until you hit more scale or modify the job to distribute a little >> better, etc. >> >> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java <javagu...@gmail.com> wrote: >> > >> > Hi, >> > >> > I made a post on stackoverflow that I can't seem to make any headway on >> > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster >> > >> > Before someone starts making suggestions on changing the code; note that >> > the code and example on the above post is from a Udemy course and is not >> > my code. I am looking to take this dataset and code and executing the same >> > on a cluster I am looking to see the value of Spark by seeing results so >> > that the job submitted to the Spark Cluster runs in a faster time compared >> > to Standalone. >> > >> > I am currently evaluating Spark and I've thus far spent about a month of >> > weekends of my free time trying to get a Spark Cluster to show me improved >> > performance in comparison to Spark Standalone but I am not having success, >> > and after spending so much time in this, I am now looking for help from as >> > I'm time constrained (in general I'm time constrained, not for a project >> > or deadline re: Spark). >> > >> > If anyone can comment on what I need to make my example work faster on a >> > spark cluster vs standalone I'd appreciate it. >> > >> > Alternatively if someone can point me to a simple code example + dataset >> > that works better and illustrates the power of distributed spark I'd be >> > happy to use that instead - I'm not wedded to this example that I got from >> > the course - I'm just looking for the simple 5 min to 30 min example quick >> > start that shows the power of Spark distributed clusters. >> > >> > There's a higher level question here and one that is not obvious to find >> > an answer for. There are many examples on Spark out there, but there is >> > not a simple large dataset + code example that illustrates the performance >> > gain of Spark's cluster and distributed computing benefits vs just a >> > single local standalone; which is what someone in my position is looking >> > for (someone who makes architectural and platform decisions and is >> > bandwidth / time constrained and wants to see the power and advantages of >> > Spark cluster and distributed computing without spending weeks on the >> > problem). >> > >> > I'm also willing to open this up to a consulting engagement if anyone is >> > interested as I'd expect it to be quick (either you have a simple example >> > that just needs to be setup etc or its easy for you to demonstrate cluster >> > performance > standalone for this dataset) >> > >> > Thx >> > >> > >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org