Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
Thanks Derrick, when I count the unique terms it is very small. So I added this... val tfidf_features = lines.flatMap(x => x._2.split(" ").filter(_.length > 2)).distinct().count().toInt val hashingTF = new HashingTF(tfidf_features) -- View this message in context: http://apache-spark-user-list

Re: KMeans with large clusters Java Heap Space

2015-01-30 Thread derrickburns
By default, HashingTF turns each document into a sparse vector in R^(2^20), i.e. a million dimensional space. The current Spark clusterer turns each sparse into a dense vector with a million entries when it is added to a cluster. Hence, the memory needed grows as the number of clusters times 8M b