Re: memory leak when saving Parquet files in Spark

Matt K Mon, 14 Dec 2015 08:30:11 -0800

Thanks Cheng!

I'm running 1.5. After setting the following, I'm no longer seeing this
issue:


sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Thanks,
-Matt

On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian <[email protected]> wrote:

> This is probably caused by schema merging. Were you using Spark 1.4 or
> earlier versions? Could you please try the following snippet to see whether
> it helps:
>
> df.write
>   .format("parquet")
>   .option("mergeSchema", "false")
>   .partitionBy(partitionCols: _*)
>   .mode(saveMode)
>   .save(targetPath)
>
> In 1.5, we've disabled schema merging by default.
>
> Cheng
>
>
> On 12/11/15 5:33 AM, Matt K wrote:
>
> Hi all,
>
> I have a process that's continuously saving data as Parquet with Spark.
> The bulk of the saving logic simply looks like this:
>
>           df.write
>             .format("parquet")
>             .partitionBy(partitionCols: _*)
>             .mode(saveMode).save(targetPath)
>
> After running for a day or so, my process ran out of memory. I took a
> memory-dump. I see that a single thread is holding 32,189
> org.apache.parquet.hadoop.Footer objects, which in turn hold
> ParquetMetadata. This is highly suspicious, since each thread processes
> under 1GB of data at a time, and there's usually no more than 10 files in a
> single batch (no small file problem). So there may be a memory leak
> somewhere in the saveAsParquet code-path.
>
> I've attached a screen-shot from Eclipse MemoryAnalyzer showing the above.
> Note 32,189 references.
>
> A shot in the dark, but is there a way to disable ParquetMetadata file
> generation?
>
> Thanks,
> -Matt
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: memory leak when saving Parquet files in Spark

Reply via email to