Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
Thanks for the useful links. Cheers, Julien 2014-08-21 11:47 GMT+02:00 Yanbo Liang : > In Spark/MLlib, task serialization such as cluster centers of k-means was > replaced by broadcast variables due to performance. > You can refer this PR https://github.com/apache/spark/pull/1427 > And curren

Re: Broadcast vs simple variable

2014-08-21 Thread Yanbo Liang
In Spark/MLlib, task serialization such as cluster centers of k-means was replaced by broadcast variables due to performance. You can refer this PR https://github.com/apache/spark/pull/1427 And current k-means implementation of MLlib, it's benefited from sparse vector computing. http://spark-summit

Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
My Arrays are in fact Array[Array[Long]] and like 17x15 (17 centers with 150 000 modalities, i'm working on qualitative variables) so they are pretty large. I'm working on it to get them smaller, it's mostly a sparse matrix. Good things to know nervertheless. Thanks, Julien Naour 2014-08-20

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed, Aug

Broadcast vs simple variable

2014-08-20 Thread Julien Naour
Hi, I have a question about broadcast. I'm working on a clustering algorithm close to KMeans. It seems that KMeans broadcast clusters centers at each step. For the moment I just use my centers as Array that I call directly in my map at each step. Could it be more efficient to use broadcast instead