[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

Chesnay Schepler (Jira) Mon, 24 Aug 2020 15:24:46 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183623#comment-17183623
 ]


Chesnay Schepler commented on FLINK-19005:
------------------------------------------

My conclusion is that Flink is not leaking anything, and the errors are due to 
unfortunate timings or some JDK issue.

I was able to reproduce the issue when submitting jobs in directly after 
another / with 5 seconds in between, but after increasing the backoff to 1 
minute the OOM no longer occurred. The GC states also showed that the Metaspace 
usage did not continuously increase; the GC created distinct dips that 
frequently managed to match or even undercut prior dips.

[Stephans 
comment|https://issues.apache.org/jira/browse/FLINK-16408?focusedCommentId=17180577&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17180577]
 appears to apply here, at the very least for all the mentioned cases where 
Wordcounts are frequently run.

As for the original issue by [~gestevez], this looks like a clear case of 
classloaders being leaked. There are (at least) a bunch of 
{{oracle.jdbc.driver.BlockSource.ThreadedCachingBlockSource.BlockReleaser}} 
threads hanging around preventing the garbage collection.
So technically, this is a thread leak inherent to this library or caused by 
improper usage.

> used metaspace grow on every execution
> --------------------------------------
>
>                 Key: FLINK-19005
>                 URL: https://issues.apache.org/jira/browse/FLINK-19005
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission, Runtime / Configuration, 
> Runtime / Coordination
>    Affects Versions: 1.11.1
>            Reporter: Guillermo Sánchez
>            Assignee: Chesnay Schepler
>            Priority: Major
>         Attachments: heap_dump_after_10_executions.zip, 
> heap_dump_after_1_execution.zip
>
>
> Hi !
> Im running a 1.11.1 flink cluster, where I execute batch jobs made with 
> DataSet API.
> I submit these jobs every day to calculate daily data.
> In every execution, cluster's used metaspace increase by 7MB and its never 
> released.
> This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i 
> need to restart the cluster to clean the metaspace
> taskmanager.memory.jvm-metaspace.size is set to 512mb
> Any idea of what could be causing this metaspace grow and why is it not 
> released ?
>  
> ================================================
> === Summary ======================================
> ================================================
> Case 1, reported by [~gestevez]:
> * Flink 1.11.1
> * Java 11
> * Maximum Metaspace size set to 512mb
> * Custom Batch job, submitted daily
> * Requires restart every 15 days after an OOM
>  Case 2, reported by [~Echo Lee]:
> * Flink 1.11.0
> * Java 11
> * G1GC
> * WordCount Batch job, submitted every second / every 5 minutes
> * eventually fails TaskExecutor with OOM
> Case 3, reported by [~DaDaShen]
> * Flink 1.11.0
> * Java 11
> * WordCount Batch job, submitted every 5 seconds
> * growing Metaspace, eventually OOM
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19005) used metaspace grow on every execution

Reply via email to