[ 
https://issues.apache.org/jira/browse/FLINK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725870#comment-16725870
 ] 

Nawaid Shamim commented on FLINK-10317:
---------------------------------------

I guess the root cause is memory leak due to dynamic loading. Limiting 
Metaspace to a number or throwing more memory at it would simply delay OOM. 
Limiting metaspace still causes OutOfMemoryError: Metaspace exception but in 
this case task manager dies instead of YARN killing it.

I was able to reproduce the above issue in relatively smaller setup - One 
Master and One Core. 
* Start 1 Job Manager (JM).
* Start 2 Task Managers - TM1 and TM2. 
* Submit job with global parallelism value of two so that both job is scheduled 
on both TMs. 
* Wait for job to take first checkpoint.
* For every 4 minutes:
** Take heap dump of JB, TM1, TM2. 
** Restart TM2 process. 

On every restart, TM2's JVM / YARN container is restarted. JB issues restart 
and restore RPC. TM2 is new process while TM1 is old process and will reload 
duplicate classes (that's where metaspace is exploding). I think it has 
something to do with 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ParentFirstClassLoader#2






> Configure Metaspace size by default
> -----------------------------------
>
>                 Key: FLINK-10317
>                 URL: https://issues.apache.org/jira/browse/FLINK-10317
>             Project: Flink
>          Issue Type: Bug
>          Components: Startup Shell Scripts
>    Affects Versions: 1.5.3, 1.6.0, 1.7.0
>            Reporter: Stephan Ewen
>            Assignee: vinoyang
>            Priority: Major
>             Fix For: 1.6.4, 1.7.2, 1.8.0
>
>         Attachments: Screenshot 2018-12-18 at 12.14.11.png
>
>
> We should set the size of the JVM Metaspace to a sane default, like  
> {{-XX:MaxMetaspaceSize=256m}}.
> If not set, the JVM offheap memory will grow indefinitely with repeated 
> classloading and Jitting, eventually exceeding allowed memory on docker/yarn 
> or similar setups.
> It is hard to come up with a good default, however, I believe the error 
> messages one gets when metaspace is too small are easy to understand (and 
> easy to take action), while it is very hard to figure out why the memory 
> footprint keeps growing steadily and infinitely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to