I have a dataset that contains DocID, WordID and frequency (count) as shown below. Note that the first three numbers represent 1. the number of documents, 2. the number of words in the vocabulary and 3. the total number of words in the collection.
189 1430 12300 1 2 1 1 39 1 1 42 3 1 77 1 1 95 1 1 96 1 2 105 1 2 108 1 3 133 3 What I want to do is to read the data (ignore the first three lines), combine the words per document and finally represent each document as a vector that contains the frequency of the wordID. Based on the above dataset the representation of documents 1, 2 and 3 will be (note that vocab_size can be extracted by the second line of the data): val data = Array( Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0), (95, 1.0), (96, 1.0))), Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))), Vectors.sparse(vocab_size, Seq((133, 3.0)))) The problem is that I am not quite sure how to read the .txt.gz file as RDD and create an Array of sparse vectors as described above. Please note that I actually want to pass the data array in the PCA transformer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org