Hi,

We are converting some csv log files to parquet but the job is getting
progressively slower the more files we add to the parquet folder.

The parquet files are being written to s3, we are using a spark
standalone cluster running on ec2 and the spark version is 1.4.1. The
parquet files are partitioned on two columns, first the date, then
another column. We write the data one day at a time, and the final
size of the data for one day when it is written out to parquet is
about 150GB.

We coalesce the data before it is written out, and in total per day we
have 615 partitions/files written out to s3. We use the
SaveMode.Append since we are always writing to the same directory.
This is the command we use to write the data.

df.coalesce(partitions).write.mode(SaveMode.Append).partitionBy("dt","outcome").parquet("s3n://root/parquet/dir/")

Writing the parquet file to an empty directory completes almost
immediately, whereas after 12 days worth of data has been written,
each parquet write takes up to 20 minutes (and there are 4 writes per
day).

Questions
Is there a more efficient way to write the data? I'm guessing that the
update to the parquet metadata is the issue, and that it happens in a
serial fashion.
Is there a way to write the metadata in the partitioned folders, and
would this speed things up?
Would this have any implications for reading in the data?
I came across DirectParquetOutputCommitter, but the source for it says
it cannot be used with Append mode, would this be useful?


I came across this issue -
https://issues.apache.org/jira/browse/SPARK-8125 and corresponding
pull request - https://github.com/apache/spark/pull/7396, but it looks
like they are more geared for reading parquet metadata in parallel as
opposed to writing it. Is this the case?

Any help would be much appreciated,

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to