/Someone will correct me if I'm wrong./ Actually, TF-IDF scores terms for a given document, an specifically TF. Internally, these things are holding a Vector (hopefully sparsed) representing all the possible words (up to 2²⁰) per document. So each document afer applying TF, will be transformed in a Vector. `indexOf` gives the index in the latter Vector.
So you can ask the frequency for all the terms in *a doc* by looping on the doc's terms and ask for the value hold in the vector at the place returned by indexOf. The problem you'll face in this case is that with the current implementation it's hard to retrieve the document back. 'Cause the result you'll have is only RDD[Vector]... so which item in your RDD is actually the document you want? I faced the same problem (for a demo I did at devoxx on the wikipedia data), hence I've updated in a repo the code of TF-IDF to allow it to hold a reference to the original document. https://github.com/andypetrella/TF-IDF If you use this impl (which I need to find some time to integrate in spark :-/ ) you'll can build a pair RDD consisting (Path, Vector) for instance. Then this pair RDD can be search (filter + take) for the doc you need and finally asking for the freq (or even after the tfidf score) HTH andy On Thu Nov 20 2014 at 1:14:24 AM Daniel, Ronald (ELS-SDG) < r.dan...@elsevier.com> wrote: > Hi all, > > > > I want to try the TF-IDF functionality in MLlib. > > I can feed it words and generate the tf and idf RDD[Vector]s, using the > code below. > > But how do I get this back to words and their counts and tf-idf values for > presentation? > > > > > > val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable") > > val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split(" ").toSeq) > > val hashingTF = new HashingTF() > > val tf: RDD[Vector] = hashingTF.transform(documents) > > tf.cache() > > val idf = new IDF().fit(tf) > > val tfidf: RDD[Vector] = idf.transform(tf) > > > > It looks like I can get the indices of the terms using something like > > > > J = wordListRDD.map(w => hashingTF.indexOf(w)) > > > > where wordList is an RDD holding the distinct words from the sequence of > words used to come up with tf. > > But how do I do the equivalent of > > > > Counts = J.map(j => tf.counts(j)) ? > > > > Thanks, > > Ron > > >