Thanks for the useful links.
Cheers,
Julien
2014-08-21 11:47 GMT+02:00 Yanbo Liang :
> In Spark/MLlib, task serialization such as cluster centers of k-means was
> replaced by broadcast variables due to performance.
> You can refer this PR https://github.com/apache/spark/pull/1427
> And curren
In Spark/MLlib, task serialization such as cluster centers of k-means was
replaced by broadcast variables due to performance.
You can refer this PR https://github.com/apache/spark/pull/1427
And current k-means implementation of MLlib, it's benefited from sparse
vector computing.
http://spark-summit
My Arrays are in fact Array[Array[Long]] and like 17x15 (17 centers
with 150 000 modalities, i'm working on qualitative variables) so they are
pretty large. I'm working on it to get them smaller, it's mostly a sparse
matrix.
Good things to know nervertheless.
Thanks,
Julien Naour
2014-08-20
For large objects, it will be more efficient to broadcast it. If your array
is small it won't really matter. How many centers do you have? Unless you
are finding that you have very large tasks (and Spark will print a warning
about this), it could be okay to just reference it directly.
On Wed, Aug
Hi,
I have a question about broadcast. I'm working on a clustering algorithm
close to KMeans. It seems that KMeans broadcast clusters centers at each
step. For the moment I just use my centers as Array that I call directly in
my map at each step. Could it be more efficient to use broadcast instead