Re: mllib performance on cluster

Evan R. Sparks Tue, 02 Sep 2014 12:18:11 -0700

Also - what hardware are you running the cluster on? And what is the local
machine hardware?



On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

> How many iterations are you running? Can you provide the exact details
> about the size of the dataset? (how many data points, how many features) Is
> this sparse or dense - and for the sparse case, how many non-zeroes? How
> many partitions is your data RDD?
>
> For very small datasets the scheduling overheads of shipping tasks across
> the cluster and delays due to stragglers can dominate the time actually
> doing your parallel computation. If you have too few partitions, you won't
> be taking advantage of cluster parallelism, and if you have too many you're
> introducing even more of the aforementioned overheads.
>
>
>
> On Tue, Sep 2, 2014 at 11:24 AM, SK <skrishna...@gmail.com> wrote:
>
>> Hi,
>>
>> I evaluated the runtime performance of some of the MLlib classification
>> algorithms on a local machine and a cluster with 10 nodes. I used
>> standalone
>> mode and Spark 1.0.1 in both cases. Here are the results for the total
>> runtime:
>>                                    Local             Cluster
>> Logistic regression       138 sec          336 sec
>> SVM                           138 sec          336 sec
>> Decision tree                 50 sec         132 sec
>>
>> My dataset is quite small and my programs are very similar to the mllib
>> examples that are included in the Spark distribution. Why is the runtime
>> on
>> the cluster significantly higher (almost 3 times) than that on the local
>> machine even though the former uses more memory and more nodes? Is it
>> because of the communication overhead on the cluster? I would like to know
>> if there is something I need to be doing to optimize the performance on
>> the
>> cluster or if others have also been getting similar results.
>>
>> thanks
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: mllib performance on cluster

Reply via email to