[ https://issues.apache.org/jira/browse/FLINK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725870#comment-16725870 ]
Nawaid Shamim commented on FLINK-10317: --------------------------------------- I guess the root cause is memory leak due to dynamic loading. Limiting Metaspace to a number or throwing more memory at it would simply delay OOM. Limiting metaspace still causes OutOfMemoryError: Metaspace exception but in this case task manager dies instead of YARN killing it. I was able to reproduce the above issue in relatively smaller setup - One Master and One Core. * Start 1 Job Manager (JM). * Start 2 Task Managers - TM1 and TM2. * Submit job with global parallelism value of two so that both job is scheduled on both TMs. * Wait for job to take first checkpoint. * For every 4 minutes: ** Take heap dump of JB, TM1, TM2. ** Restart TM2 process. On every restart, TM2's JVM / YARN container is restarted. JB issues restart and restore RPC. TM2 is new process while TM1 is old process and will reload duplicate classes (that's where metaspace is exploding). I think it has something to do with org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ParentFirstClassLoader#2 > Configure Metaspace size by default > ----------------------------------- > > Key: FLINK-10317 > URL: https://issues.apache.org/jira/browse/FLINK-10317 > Project: Flink > Issue Type: Bug > Components: Startup Shell Scripts > Affects Versions: 1.5.3, 1.6.0, 1.7.0 > Reporter: Stephan Ewen > Assignee: vinoyang > Priority: Major > Fix For: 1.6.4, 1.7.2, 1.8.0 > > Attachments: Screenshot 2018-12-18 at 12.14.11.png > > > We should set the size of the JVM Metaspace to a sane default, like > {{-XX:MaxMetaspaceSize=256m}}. > If not set, the JVM offheap memory will grow indefinitely with repeated > classloading and Jitting, eventually exceeding allowed memory on docker/yarn > or similar setups. > It is hard to come up with a good default, however, I believe the error > messages one gets when metaspace is too small are easy to understand (and > easy to take action), while it is very hard to figure out why the memory > footprint keeps growing steadily and infinitely. -- This message was sent by Atlassian JIRA (v7.6.3#76005)