I'm trying to cluster short text messages using KMeans, after trained the
kmeans I want to get the top terms (5 - 10). How do I get that using
clusterCenters?
full code is here
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html
--
View
Thanks Derrick, when I count the unique terms it is very small. So I added
this...
val tfidf_features = lines.flatMap(x => x._2.split(" ").filter(_.length >
2)).distinct().count().toInt
val hashingTF = new HashingTF(tfidf_features)
--
View this message in context:
http://apache-spark-user-list
Trying to cluster small text msgs, using HashingTF and IDF with L2
Normalization. Data looks like this
id, msg
1, some text1
2, some more text2
3, sample text 3
Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3
hrs) for upto 20 clusters, but when I ask for 200 clusters gett