Re: mllib performance on cluster

Evan R. Sparks Tue, 02 Sep 2014 11:59:21 -0700

How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?


For very small datasets the scheduling overheads of shipping tasks across
the cluster and delays due to stragglers can dominate the time actually
doing your parallel computation. If you have too few partitions, you won't
be taking advantage of cluster parallelism, and if you have too many you're
introducing even more of the aforementioned overheads.



On Tue, Sep 2, 2014 at 11:24 AM, SK <skrishna...@gmail.com> wrote:

> Hi,
>
> I evaluated the runtime performance of some of the MLlib classification
> algorithms on a local machine and a cluster with 10 nodes. I used
> standalone
> mode and Spark 1.0.1 in both cases. Here are the results for the total
> runtime:
>                                    Local             Cluster
> Logistic regression       138 sec          336 sec
> SVM                           138 sec          336 sec
> Decision tree                 50 sec         132 sec
>
> My dataset is quite small and my programs are very similar to the mllib
> examples that are included in the Spark distribution. Why is the runtime on
> the cluster significantly higher (almost 3 times) than that on the local
> machine even though the former uses more memory and more nodes? Is it
> because of the communication overhead on the cluster? I would like to know
> if there is something I need to be doing to optimize the performance on the
> cluster or if others have also been getting similar results.
>
> thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: mllib performance on cluster

Reply via email to