Re: Parquet writing gets progressively slower

2015-07-26 Thread Cheng Lian
Actually no. In general, Spark SQL doesn't trust Parquet summary files. The reason is that it's not unusual to fail to write Parquet summary files. For example, Hive never writes summary files for Parquet tables because it uses NullOutputCommitter, which bypasses Parquet's own output committer.

Re: Parquet writing gets progressively slower

2015-07-25 Thread Michael Kelly
Thanks for the suggestion Cheng, I will try that today. Are there any implications when reading the parquet data if there are no summary files present? Michael On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian wrote: > The time is probably spent by ParquetOutputFormat.commitJob. While > committing a s

Re: Parquet writing gets progressively slower

2015-07-24 Thread Cheng Lian
The time is probably spent by ParquetOutputFormat.commitJob. While committing a successful write job, Parquet writes a pair of summary files, containing metadata like schema, user defined key-value metadata, and Parquet row group information. To gather all the necessary information, Parquet sca

Parquet writing gets progressively slower

2015-07-24 Thread Michael Kelly
Hi, We are converting some csv log files to parquet but the job is getting progressively slower the more files we add to the parquet folder. The parquet files are being written to s3, we are using a spark standalone cluster running on ec2 and the spark version is 1.4.1. The parquet files are par