Hi Dan, In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala, you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate a term.
It's also mentioned in the doc that " Maps a sequence of terms to their term frequencies using the hashing trick." Thanks On Wed, May 6, 2015 at 12:44 PM, Dan Dong <dongda...@gmail.com> wrote: > Hi, All, > When I try to follow the document about tfidf from: > http://spark.apache.org/docs/latest/mllib-feature-extraction.html > > val conf = new SparkConf().setAppName("TFIDF") > val sc=new SparkContext(conf) > > val > documents=sc.textFile("hdfs://cluster-test-1:9000/user/ubuntu/textExample.txt").map(_.split(" > ").toSeq) > val hashingTF = new HashingTF() > val tf= hashingTF.transform(documents) > tf.cache() > val idf = new IDF().fit(tf) > val tfidf = idf.transform(tf) > val rdd=tfidf.map { vec => vec} > rdd.saveAsTextFile("/user/ubuntu/aaa") > > I got the following 3 lines output which corresponding to my 3 lines input > file( each line can be viewed as a separate document): > > (1048576,[3211,72752,119839,413342,504006,714241],[1.3862943611198906,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453,0.6931471805599453]) > > > (1048576,[53232,96852,109270,119839],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.0]) > > > (1048576,[3139,5740,119839,502586,503762],[0.6931471805599453,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453]) > > But how to interpret this? How to match words to the tfidf values? E.g: > word1->1.3862943611198906 > word2->0.6931471805599453 > ...... > > In general, how should people interpret/analyze "tfidf" from the > following? Thanks! > val tfidf = idf.transform(tf) > > Cheers, > Dan > > > -- Best Ai