Hi there, I have several large files (500GB per file) to transform into Parquet format and write to HDFS. The problems I encountered can be described as follows:
1) At first, I tried to load all the records in a file and then used "sc.parallelize(data)" to generate RDD and finally used "saveAsNewAPIHadoopFile(...)" to write to HDFS. However, because each file was too large to be handled by memory (500GB), it did not work. 2) Then, I tried to load certain number of records at a time, but I had to launch a lot of "saveAsNewAPIHadoopFile(...)" tasks and the file directory became two levels: data/0/part0 --- part29 data/1/part0 --- part29 ...... And when I tried to access the "data" directory to process all the parts, I did not know the directory hierarchy. I do not know if HDFS has the ability to get the hierarchy of a directory. If so, my problem can be solved by utilizing that information. Another way is to generate all the files in a flat directory, like: data/part0 ---- part10000 And then the API "newAPIHadoopFile" can read all of them. Any suggestions? Thanks very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-transform-large-local-files-into-Parquet-format-and-write-into-HDFS-tp12131.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org