Hi,
I read this document,
http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried
to build a TF-IDF model of my documents.
I have a list of documents, each word is represented as a Int, and each
document is listed in one line.
doc_name, int1, int2...
doc_name, int3, int4...
This is how I load my documents:
val documents: RDD[Seq[Int]] = sc.objectFile[(String,
Seq[Int])](s"$sparkStore/documents") map (_._2) cache()
Then I did:
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
I write the tfidf model to a text file and try to understand the structure.
FileUtils.writeLines(new File("tfidf.out"),
tfidf.collect().toList.asJavaCollection)
What I is something like:
(1048576,[0,4,7,8,10,13,17,21....],[...some float numbers...])
...
I think it s a tuple with 3 element.
- I have no idea what the 1st element is...
- I think the 2nd element is a list of the word
- I think the 3rd element is a list of tf-idf value of the words in the
previous list
Please help me understand this structure.
Thanks,
David