Maybe the better approach is to understand why your job isnt scaling -
what does the UI show? are the resources actually the same? for
example do you have more than 8 cores in the local setup?
Is there enough parallelism? for example it doesn't look like the
small input is repartitioned to at least the cluster parallelism /
default parallelism.

Something that should trivially parallelize? the SparkPi example

You can try tools like https://codait.github.io/spark-bench/ to
generate large workloads.

On Fri, Sep 25, 2020 at 1:03 AM javaguy Java <javagu...@gmail.com> wrote:
>
> Hi Sean,
>
> Thanks for your reply.
>
> I understand distribution and parallelism very well and have used it with 
> other products like GridGain and various master worker patterns etc; I just 
> don't have a simple example working with Apache Spark which is what I am 
> looking for.  I know Spark doesn't follow the others parallelism paradigm so 
> I'm looking for a distributed example that illustrates Spark's distribution 
> capabilities very well - and correct I want the total wall clock completion 
> time to go down.
>
> I think you misunderstood one thing re: the several machines blurb I have in 
> my post.  My spark cluster has the 2 of the same "identical" machines.  So 
> its not a split of cores and memory.  It's a doubling of cores and memory.
>
> To recap, the spark cluster on my home network is running 2x Mac's with 32GB 
> ram EACH (so 64GB ram in total) with the same processor size on each; however 
> when I run this code example on just one mac + Spark Standalone + local[*] it 
> is faster
>
> I have subsequently moved my example to AWS and on AWS I'm running two 
> identical EC2 instances (so again double the RAM and Cores) co-located in the 
> same AZ and the spark cluster is still slower compared to spark standalone on 
> one of these EC2 instances :(
>
> Hence my posts to Spark user group.
>
> I'm not wedded to this Udemy course example; I wish someone could just point 
> me at an example with some quick code and a large public data set and say 
> this runs faster on a cluster than standalone.  I'd be happy to make a post 
> myself for any new people interested in Spark.
>
> Thanks
>
>
>
>
>
>
>
>
> On Thu, Sep 24, 2020 at 9:58 PM Sean Owen <sro...@gmail.com> wrote:
>>
>> If you have the same amount of resource (cores, memory, etc) on one
>> machine, that is pretty much always going to be faster than using
>> those same resources split across several machines.
>> Even if you have somewhat more resource available on a cluster, the
>> distributed version could be slower if you, for example, are
>> bottlenecking on network I/O and leaving some resources underutilized.
>>
>> Distributing isn't really going to make the workload consume less
>> resource; on the contrary it makes it take more. However it might make
>> the total wall-clock completion time go way down through parallelism.
>> How much you benefit from parallelism really depends on the problem,
>> the cluster, the input, etc. You may not see a speedup in this problem
>> until you hit more scale or modify the job to distribute a little
>> better, etc.
>>
>> On Thu, Sep 24, 2020 at 1:43 PM javaguy Java <javagu...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I made a post on stackoverflow that I can't seem to make any headway on
>> > https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster
>> >
>> > Before someone starts making suggestions on changing the code; note that 
>> > the code and example on the above post is from a Udemy course and is not 
>> > my code. I am looking to take this dataset and code and executing the same 
>> > on a cluster I am looking to see the value of Spark by seeing results so 
>> > that the job submitted to the Spark Cluster runs in a faster time compared 
>> > to Standalone.
>> >
>> > I am currently evaluating Spark and I've thus far spent about a month of 
>> > weekends of my free time trying to get a Spark Cluster to show me improved 
>> > performance in comparison to Spark Standalone but I am not having success, 
>> > and after spending so much time in this, I am now looking for help from as 
>> > I'm time constrained (in general I'm time constrained, not for a project 
>> > or deadline re: Spark).
>> >
>> > If anyone can comment on what I need to make my example work faster on a 
>> > spark cluster vs standalone I'd appreciate it.
>> >
>> > Alternatively if someone can point me to a simple code example + dataset 
>> > that works better and illustrates the power of distributed spark I'd be 
>> > happy to use that instead - I'm not wedded to this example that I got from 
>> > the course - I'm just looking for the simple 5 min to 30 min example quick 
>> > start that shows the power of Spark distributed clusters.
>> >
>> > There's a higher level question here and one that is not obvious to find 
>> > an answer for.  There are many examples on Spark out there, but there is 
>> > not a simple large dataset + code example that illustrates the performance 
>> > gain of Spark's cluster and distributed computing benefits vs just a 
>> > single local standalone; which is what someone in my position is looking 
>> > for (someone who makes architectural and platform decisions and is 
>> > bandwidth / time constrained and wants to see the power and advantages of 
>> > Spark cluster and distributed computing without spending weeks on the 
>> > problem).
>> >
>> > I'm also willing to open this up to a consulting engagement if anyone is 
>> > interested as I'd expect it to be quick (either you have a simple example 
>> > that just needs to be setup etc or its easy for you to demonstrate cluster 
>> > performance > standalone for this dataset)
>> >
>> > Thx
>> >
>> >
>> >

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to