Spark MLLib KMeans Top Terms

2015-03-19 Thread mvsundaresan
I'm trying to cluster short text messages using KMeans, after trained the kmeans I want to get the top terms (5 - 10). How do I get that using clusterCenters? full code is here http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html -- View

Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
Thanks Derrick, when I count the unique terms it is very small. So I added this... val tfidf_features = lines.flatMap(x => x._2.split(" ").filter(_.length > 2)).distinct().count().toInt val hashingTF = new HashingTF(tfidf_features) -- View this message in context: http://apache-spark-user-list

KMeans with large clusters Java Heap Space

2015-01-29 Thread mvsundaresan
Trying to cluster small text msgs, using HashingTF and IDF with L2 Normalization. Data looks like this id, msg 1, some text1 2, some more text2 3, sample text 3 Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3 hrs) for upto 20 clusters, but when I ask for 200 clusters gett