Hi there,

I have several large files (500GB per file) to transform into Parquet format
and write to HDFS. The problems I encountered can be described as follows:

1) At first, I tried to load all the records in a file and then used
"sc.parallelize(data)" to generate RDD and finally used
"saveAsNewAPIHadoopFile(...)" to write to HDFS. However, because each file
was too large to be handled by memory (500GB), it did not work.

2) Then, I tried to load certain number of records at a time, but I had to
launch a lot of "saveAsNewAPIHadoopFile(...)" tasks and the file directory
became two levels:

    data/0/part0 --- part29
    data/1/part0 --- part29
    ......
And when I tried to access the "data" directory to process all the parts, I
did not know the directory hierarchy.

I do not know if HDFS has the ability to get the hierarchy of a directory.
If so, my problem can be solved by utilizing that information. Another way
is to generate all the files in a flat directory, like:

data/part0 ---- part10000

And then the API "newAPIHadoopFile" can read all of them.

Any suggestions? Thanks very much.
 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-transform-large-local-files-into-Parquet-format-and-write-into-HDFS-tp12131.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to