Actually no. In general, Spark SQL doesn't trust Parquet summary files.
The reason is that it's not unusual to fail to write Parquet summary
files. For example, Hive never writes summary files for Parquet tables
because it uses NullOutputCommitter, which bypasses Parquet's own output
committer.
Thanks for the suggestion Cheng, I will try that today.
Are there any implications when reading the parquet data if there are
no summary files present?
Michael
On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian wrote:
> The time is probably spent by ParquetOutputFormat.commitJob. While
> committing a s
The time is probably spent by ParquetOutputFormat.commitJob. While
committing a successful write job, Parquet writes a pair of summary
files, containing metadata like schema, user defined key-value metadata,
and Parquet row group information. To gather all the necessary
information, Parquet sca
Hi,
We are converting some csv log files to parquet but the job is getting
progressively slower the more files we add to the parquet folder.
The parquet files are being written to s3, we are using a spark
standalone cluster running on ec2 and the spark version is 1.4.1. The
parquet files are par