You need to change `== 1` to `== i`. `println(t)` happens on the workers, which may not be what you want. Try the following:
noSets.filter(t => model.predict(Utils.featurize(t)) == i).collect().foreach(println) -Xiangrui On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb <richard.pierce.l...@gmail.com> wrote: > Hi all, > > I'm very new to machine learning algorithms and Spark. I'm follow the > Twitter Streaming Language Classifier found here: > > http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html > > Specifically this code: > > http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala > > Except I'm trying to run it in batch mode on some tweets it pulls out > of Cassandra, in this case 200 total tweets. > > As the example shows, I am using this object for "vectorizing" a set of > tweets: > > object Utils{ > val numFeatures = 1000 > val tf = new HashingTF(numFeatures) > > /** > * Create feature vectors by turning each tweet into bigrams of > * characters (an n-gram model) and then hashing those to a > * length-1000 feature vector that we can pass to MLlib. > * This is a common way to decrease the number of features in a > * model while still getting excellent accuracy (otherwise every > * pair of Unicode characters would potentially be a feature). > */ > def featurize(s: String): Vector = { > tf.transform(s.sliding(2).toSeq) > } > } > > Here is my code which is modified from ExaminAndTrain.scala: > > val noSets = rawTweets.map(set => set.mkString("\n")) > > val vectors = noSets.map(Utils.featurize).cache() > vectors.count() > > val numClusters = 5 > val numIterations = 30 > > val model = KMeans.train(vectors, numClusters, numIterations) > > for (i <- 0 until numClusters) { > println(s"\nCLUSTER $i") > noSets.foreach { > t => if (model.predict(Utils.featurize(t)) == 1) { > println(t) > } > } > } > > This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc > with nothing printing beneath. If i flip > > models.predict(Utils.featurize(t)) == 1 to > models.predict(Utils.featurize(t)) == 0 > > the same thing happens except every tweet is printed beneath every cluster. > > Here is what I intuitively think is happening (please correct my > thinking if its wrong): This code turns each tweet into a vector, > randomly picks some clusters, then runs kmeans to group the tweets (at > a really high level, the clusters, i assume, would be common > "topics"). As such, when it checks each tweet to see if models.predict > == 1, different sets of tweets should appear under each cluster (and > because its checking the training set against itself, every tweet > should be in a cluster). Why isn't it doing this? Either my > understanding of what kmeans does is wrong, my training set is too > small or I'm missing a step. > > Any help is greatly appreciated > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org