Thanks Derrick, when I count the unique terms it is very small. So I added
this...
val tfidf_features = lines.flatMap(x => x._2.split(" ").filter(_.length >
2)).distinct().count().toInt
val hashingTF = new HashingTF(tfidf_features)
--
View this message in context:
http://apache-spark-user-list
By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e. a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster. Hence, the memory needed grows as the number of clusters times 8M
b