Thanks for the feedback, Matt!
Yes, we've also seen other feedback about slow Parquet summary file
generation, especially when appending a small dataset to an existing
large dataset. Disabling it is a reasonable workaround since the summary
files are no longer important after parquet-mr 1.7.
We're planning to turn it off by default in future versions.
Cheng
On 12/15/15 12:27 AM, Matt K wrote:
Thanks Cheng!
I'm running 1.5. After setting the following, I'm no longer seeing
this issue:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Thanks,
-Matt
On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian <[email protected]
<mailto:[email protected]>> wrote:
This is probably caused by schema merging. Were you using Spark
1.4 or earlier versions? Could you please try the following
snippet to see whether it helps:
df.write
.format("parquet")
.option("mergeSchema", "false")
.partitionBy(partitionCols: _*)
.mode(saveMode)
.save(targetPath)
In 1.5, we've disabled schema merging by default.
Cheng
On 12/11/15 5:33 AM, Matt K wrote:
Hi all,
I have a process that's continuously saving data as Parquet with
Spark. The bulk of the saving logic simply looks like this:
df.write
.format("parquet")
.partitionBy(partitionCols: _*)
.mode(saveMode).save(targetPath)
After running for a day or so, my process ran out of memory. I
took a memory-dump. I see that a single thread is holding 32,189
org.apache.parquet.hadoop.Footer objects, which in turn hold
ParquetMetadata. This is highly suspicious, since each thread
processes under 1GB of data at a time, and there's usually no
more than 10 files in a single batch (no small file problem). So
there may be a memory leak somewhere in the saveAsParquet code-path.
I've attached a screen-shot from Eclipse MemoryAnalyzer showing
the above. Note 32,189 references.
A shot in the dark, but is there a way to disable ParquetMetadata
file generation?
Thanks,
-Matt
---------------------------------------------------------------------
To unsubscribe, e-mail:[email protected]
<mailto:[email protected]>
For additional commands, e-mail:[email protected]
<mailto:[email protected]>
--
www.calcmachine.com <http://www.calcmachine.com> - easy online calculator.