Re: Parquet writing gets progressively slower

2015-07-26 Thread Cheng Lian
Actually no. In general, Spark SQL doesn't trust Parquet summary files. The reason is that it's not unusual to fail to write Parquet summary files. For example, Hive never writes summary files for Parquet tables because it uses NullOutputCommitter, which bypasses Parquet's own output committer.

Re: Parquet writing gets progressively slower

2015-07-25 Thread Michael Kelly
Thanks for the suggestion Cheng, I will try that today. Are there any implications when reading the parquet data if there are no summary files present? Michael On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian wrote: > The time is probably spent by ParquetOutputFormat.commitJob. While > committing a s

Re: Parquet writing gets progressively slower

2015-07-24 Thread Cheng Lian
The time is probably spent by ParquetOutputFormat.commitJob. While committing a successful write job, Parquet writes a pair of summary files, containing metadata like schema, user defined key-value metadata, and Parquet row group information. To gather all the necessary information, Parquet sca