[ https://issues.apache.org/jira/browse/FLINK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183623#comment-17183623 ]
Chesnay Schepler commented on FLINK-19005: ------------------------------------------ My conclusion is that Flink is not leaking anything, and the errors are due to unfortunate timings or some JDK issue. I was able to reproduce the issue when submitting jobs in directly after another / with 5 seconds in between, but after increasing the backoff to 1 minute the OOM no longer occurred. The GC states also showed that the Metaspace usage did not continuously increase; the GC created distinct dips that frequently managed to match or even undercut prior dips. [Stephans comment|https://issues.apache.org/jira/browse/FLINK-16408?focusedCommentId=17180577&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17180577] appears to apply here, at the very least for all the mentioned cases where Wordcounts are frequently run. As for the original issue by [~gestevez], this looks like a clear case of classloaders being leaked. There are (at least) a bunch of {{oracle.jdbc.driver.BlockSource.ThreadedCachingBlockSource.BlockReleaser}} threads hanging around preventing the garbage collection. So technically, this is a thread leak inherent to this library or caused by improper usage. > used metaspace grow on every execution > -------------------------------------- > > Key: FLINK-19005 > URL: https://issues.apache.org/jira/browse/FLINK-19005 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission, Runtime / Configuration, > Runtime / Coordination > Affects Versions: 1.11.1 > Reporter: Guillermo Sánchez > Assignee: Chesnay Schepler > Priority: Major > Attachments: heap_dump_after_10_executions.zip, > heap_dump_after_1_execution.zip > > > Hi ! > Im running a 1.11.1 flink cluster, where I execute batch jobs made with > DataSet API. > I submit these jobs every day to calculate daily data. > In every execution, cluster's used metaspace increase by 7MB and its never > released. > This ends up with an OutOfMemoryError caused by Metaspace every 15 days and i > need to restart the cluster to clean the metaspace > taskmanager.memory.jvm-metaspace.size is set to 512mb > Any idea of what could be causing this metaspace grow and why is it not > released ? > > ================================================ > === Summary ====================================== > ================================================ > Case 1, reported by [~gestevez]: > * Flink 1.11.1 > * Java 11 > * Maximum Metaspace size set to 512mb > * Custom Batch job, submitted daily > * Requires restart every 15 days after an OOM > Case 2, reported by [~Echo Lee]: > * Flink 1.11.0 > * Java 11 > * G1GC > * WordCount Batch job, submitted every second / every 5 minutes > * eventually fails TaskExecutor with OOM > Case 3, reported by [~DaDaShen] > * Flink 1.11.0 > * Java 11 > * WordCount Batch job, submitted every 5 seconds > * growing Metaspace, eventually OOM > -- This message was sent by Atlassian Jira (v8.3.4#803005)