How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD?
For very small datasets the scheduling overheads of shipping tasks across the cluster and delays due to stragglers can dominate the time actually doing your parallel computation. If you have too few partitions, you won't be taking advantage of cluster parallelism, and if you have too many you're introducing even more of the aforementioned overheads. On Tue, Sep 2, 2014 at 11:24 AM, SK <skrishna...@gmail.com> wrote: > Hi, > > I evaluated the runtime performance of some of the MLlib classification > algorithms on a local machine and a cluster with 10 nodes. I used > standalone > mode and Spark 1.0.1 in both cases. Here are the results for the total > runtime: > Local Cluster > Logistic regression 138 sec 336 sec > SVM 138 sec 336 sec > Decision tree 50 sec 132 sec > > My dataset is quite small and my programs are very similar to the mllib > examples that are included in the Spark distribution. Why is the runtime on > the cluster significantly higher (almost 3 times) than that on the local > machine even though the former uses more memory and more nodes? Is it > because of the communication overhead on the cluster? I would like to know > if there is something I need to be doing to optimize the performance on the > cluster or if others have also been getting similar results. > > thanks > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >