Hi,
We are converting some csv log files to parquet but the job is getting progressively slower the more files we add to the parquet folder. The parquet files are being written to s3, we are using a spark standalone cluster running on ec2 and the spark version is 1.4.1. The parquet files are partitioned on two columns, first the date, then another column. We write the data one day at a time, and the final size of the data for one day when it is written out to parquet is about 150GB. We coalesce the data before it is written out, and in total per day we have 615 partitions/files written out to s3. We use the SaveMode.Append since we are always writing to the same directory. This is the command we use to write the data. df.coalesce(partitions).write.mode(SaveMode.Append).partitionBy("dt","outcome").parquet("s3n://root/parquet/dir/") Writing the parquet file to an empty directory completes almost immediately, whereas after 12 days worth of data has been written, each parquet write takes up to 20 minutes (and there are 4 writes per day). Questions Is there a more efficient way to write the data? I'm guessing that the update to the parquet metadata is the issue, and that it happens in a serial fashion. Is there a way to write the metadata in the partitioned folders, and would this speed things up? Would this have any implications for reading in the data? I came across DirectParquetOutputCommitter, but the source for it says it cannot be used with Append mode, would this be useful? I came across this issue - https://issues.apache.org/jira/browse/SPARK-8125 and corresponding pull request - https://github.com/apache/spark/pull/7396, but it looks like they are more geared for reading parquet metadata in parallel as opposed to writing it. Is this the case? Any help would be much appreciated, Michael --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org