Hi all,
I want to try the TF-IDF functionality in MLlib.
I can feed it words and generate the tf and idf RDD[Vector]s, using the code
below.
But how do I get this back to words and their counts and tf-idf values for
presentation?
val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable")
val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
It looks like I can get the indices of the terms using something like
J = wordListRDD.map(w => hashingTF.indexOf(w))
where wordList is an RDD holding the distinct words from the sequence of words
used to come up with tf.
But how do I do the equivalent of
Counts = J.map(j => tf.counts(j)) ?
Thanks,
Ron