Hi Dan,

In
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala,
you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate
a term.

It's also mentioned in the doc that " Maps a sequence of terms to their
term frequencies using the hashing trick."

Thanks


On Wed, May 6, 2015 at 12:44 PM, Dan Dong <dongda...@gmail.com> wrote:

> Hi, All,
>   When I try to follow the document about tfidf from:
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html
>
>      val conf = new SparkConf().setAppName("TFIDF")
>      val sc=new SparkContext(conf)
>
>      val
> documents=sc.textFile("hdfs://cluster-test-1:9000/user/ubuntu/textExample.txt").map(_.split("
> ").toSeq)
>      val hashingTF = new HashingTF()
>      val tf= hashingTF.transform(documents)
>      tf.cache()
>      val idf = new IDF().fit(tf)
>      val tfidf = idf.transform(tf)
>      val rdd=tfidf.map { vec => vec}
>      rdd.saveAsTextFile("/user/ubuntu/aaa")
>
> I got the following 3 lines output which corresponding to my 3 lines input
> file( each line can be viewed as a separate document):
>
> (1048576,[3211,72752,119839,413342,504006,714241],[1.3862943611198906,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453,0.6931471805599453])
>
>
> (1048576,[53232,96852,109270,119839],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.0])
>
>
> (1048576,[3139,5740,119839,502586,503762],[0.6931471805599453,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453])
>
>     But how to interpret this? How to match words to the tfidf values? E.g:
> word1->1.3862943611198906
> word2->0.6931471805599453
> ......
>
> In general, how should people interpret/analyze "tfidf" from the
> following? Thanks!
> val tfidf = idf.transform(tf)
>
>   Cheers,
>   Dan
>
>
>


-- 
Best
Ai

Reply via email to