Hi, All, When I try to follow the document about tfidf from: http://spark.apache.org/docs/latest/mllib-feature-extraction.html
val conf = new SparkConf().setAppName("TFIDF") val sc=new SparkContext(conf) val documents=sc.textFile("hdfs://cluster-test-1:9000/user/ubuntu/textExample.txt").map(_.split(" ").toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec => vec} rdd.saveAsTextFile("/user/ubuntu/aaa") I got the following 3 lines output which corresponding to my 3 lines input file( each line can be viewed as a separate document): (1048576,[3211,72752,119839,413342,504006,714241],[1.3862943611198906,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453,0.6931471805599453]) (1048576,[53232,96852,109270,119839],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.0]) (1048576,[3139,5740,119839,502586,503762],[0.6931471805599453,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453]) But how to interpret this? How to match words to the tfidf values? E.g: word1->1.3862943611198906 word2->0.6931471805599453 ...... In general, how should people interpret/analyze "tfidf" from the following? Thanks! val tfidf = idf.transform(tf) Cheers, Dan