I am trying to read a few hundred .parquet files from S3 into an EMR cluster.
The .parquet files are structured by date and have /_common_metadata/ in
each of the folders (as well as /_metadata/).The *sqlContext.parquetFile*
operation takes a very long time, opening for reading each of the .parquet
files. I would have expected that the /*metdata/ files would be used for
structure so that Spark does not have to go through all the files in a
folder. I have also tried for a single folder this experiment, all the
.parquet files have been opened and the /*metdata/ was apparently
ignored.What can I do to speed up the loading process? Can I load the
.parquet files in parallel? What is the purpose of the /*metadata/ files?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-in-Spark-1-3-1-Hadoop-2-4-tp22624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to