So if I am understanding your problem, you have the data in CSV files, but the CSV files are gunzipped? If so Spark can read a gunzip file directly. Sorry if I didn't understand your question.
Henry On Mon, Apr 3, 2017 at 5:05 AM, Old-School <giorgos_myrianth...@outlook.com> wrote: > I have a dataset that contains DocID, WordID and frequency (count) as shown > below. Note that the first three numbers represent 1. the number of > documents, 2. the number of words in the vocabulary and 3. the total number > of words in the collection. > > 189 > 1430 > 12300 > 1 2 1 > 1 39 1 > 1 42 3 > 1 77 1 > 1 95 1 > 1 96 1 > 2 105 1 > 2 108 1 > 3 133 3 > > > What I want to do is to read the data (ignore the first three lines), > combine the words per document and finally represent each document as a > vector that contains the frequency of the wordID. > > Based on the above dataset the representation of documents 1, 2 and 3 will > be (note that vocab_size can be extracted by the second line of the data): > > val data = Array( > Vectors.sparse(vocab_size, Seq((2, 1.0), (39, 1.0), (42, 3.0), (77, 1.0), > (95, 1.0), (96, 1.0))), > Vectors.sparse(vocab_size, Seq((105, 1.0), (108, 1.0))), > Vectors.sparse(vocab_size, Seq((133, 3.0)))) > > > The problem is that I am not quite sure how to read the .txt.gz file as RDD > and create an Array of sparse vectors as described above. Please note that > I > actually want to pass the data array in the PCA transformer. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Read-file-and-represent-rows-as-Vectors-tp28562.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Paul Henry Tremblay Robert Half Technology