Thanks for the feedback, Matt!

Yes, we've also seen other feedback about slow Parquet summary file generation, especially when appending a small dataset to an existing large dataset. Disabling it is a reasonable workaround since the summary files are no longer important after parquet-mr 1.7.

We're planning to turn it off by default in future versions.

Cheng

On 12/15/15 12:27 AM, Matt K wrote:
Thanks Cheng!

I'm running 1.5. After setting the following, I'm no longer seeing this issue:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Thanks,
-Matt

On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian <[email protected] <mailto:[email protected]>> wrote:

    This is probably caused by schema merging. Were you using Spark
    1.4 or earlier versions? Could you please try the following
    snippet to see whether it helps:

    df.write
      .format("parquet")
      .option("mergeSchema", "false")
      .partitionBy(partitionCols: _*)
      .mode(saveMode)
      .save(targetPath)

    In 1.5, we've disabled schema merging by default.

    Cheng


    On 12/11/15 5:33 AM, Matt K wrote:
    Hi all,

    I have a process that's continuously saving data as Parquet with
    Spark. The bulk of the saving logic simply looks like this:

              df.write
                .format("parquet")
                .partitionBy(partitionCols: _*)
    .mode(saveMode).save(targetPath)

    After running for a day or so, my process ran out of memory. I
    took a memory-dump. I see that a single thread is holding 32,189
    org.apache.parquet.hadoop.Footer objects, which in turn hold
    ParquetMetadata. This is highly suspicious, since each thread
    processes under 1GB of data at a time, and there's usually no
    more than 10 files in a single batch (no small file problem). So
    there may be a memory leak somewhere in the saveAsParquet code-path.

    I've attached a screen-shot from Eclipse MemoryAnalyzer showing
    the above. Note 32,189 references.

    A shot in the dark, but is there a way to disable ParquetMetadata
    file generation?

    Thanks,
    -Matt


    ---------------------------------------------------------------------
    To unsubscribe, e-mail:[email protected]
    <mailto:[email protected]>
    For additional commands, e-mail:[email protected] 
<mailto:[email protected]>




--
www.calcmachine.com <http://www.calcmachine.com> - easy online calculator.

Reply via email to