Re: Using TF-IDF from MLlib

andy petrella Thu, 20 Nov 2014 16:38:16 -0800

/Someone will correct me if I'm wrong./

Actually, TF-IDF scores terms for a given document, an specifically TF.
Internally, these things are holding a Vector (hopefully sparsed)
representing all the possible words (up to 2²⁰) per document. So each
document afer applying TF, will be transformed in a Vector. `indexOf` gives
the index in the latter Vector.

So you can ask the frequency for all the terms in *a doc* by looping on the
doc's terms and ask for the value hold in the vector at the place returned
by indexOf.

The problem you'll face in this case is that with the current
implementation it's hard to retrieve the document back. 'Cause the result
you'll have is only RDD[Vector]... so which item in your RDD is actually
the document you want?
I faced the same problem (for a demo I did at devoxx on the wikipedia
data), hence I've updated in a repo the code of TF-IDF to allow it to hold
a reference to the original document.
https://github.com/andypetrella/TF-IDF

If you use this impl (which I need to find some time to integrate in spark
:-/ ) you'll can build a pair RDD consisting (Path, Vector) for instance.
Then this pair RDD can be search (filter + take) for the doc you need and
finally asking for the freq (or even after the tfidf score)

HTH

andy

On Thu Nov 20 2014 at 1:14:24 AM Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

>  Hi all,
>
>
>
> I want to try the TF-IDF functionality in MLlib.
>
> I can feed it words and generate the tf and idf  RDD[Vector]s, using the
> code below.
>
> But how do I get this back to words and their counts and tf-idf values for
> presentation?
>
>
>
>
>
> val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable")
>
> val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split(" ").toSeq)
>
> val hashingTF = new HashingTF()
>
> val tf: RDD[Vector] = hashingTF.transform(documents)
>
> tf.cache()
>
> val idf = new IDF().fit(tf)
>
> val tfidf: RDD[Vector] = idf.transform(tf)
>
>
>
> It looks like I can get the indices of the terms using something like
>
>
>
> J = wordListRDD.map(w => hashingTF.indexOf(w))
>
>
>
> where wordList is an RDD holding the distinct words from the sequence of
> words used to come up with tf.
>
> But how do I do the equivalent of
>
>
>
> Counts  = J.map(j => tf.counts(j))  ?
>
>
>
> Thanks,
>
> Ron
>
>
>

Re: Using TF-IDF from MLlib

Reply via email to