[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak

Guowei Ma (Jira) Sun, 01 Mar 2020 23:46:20 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048845#comment-17048845
 ]


Guowei Ma commented on FLINK-11205:
-----------------------------------

[~fwiffo] I have a question about the LogFactory caching class loader leads to 
the class leak.

As far as I know, Flink does not use the Apache Commons Logging. So I assume 
that the Apache Commons Log jar is from the application. For failover restart 
only a job
 # If the Apache Commons Log and user jar are loaded by the system class loader 
I think there might be not class leak because all class is loaded by the system 
class.(The user class loader object is a leak.)
 # If the Apache Commons Log and user jar are loaded by the user class loader I 
think there might be also no class leak. The GC would release all the class.
 # If the Apache Commons Log is loaded by the system class loader and the user 
jar is load by the user class loader. I think there might be class leaks if we 
do not call LogFactory.release when closing.

 Do you mean the third scenario? Why do you not choose the other two scenarios? 
Correct me If I miss understanding something.

> Task Manager Metaspace Memory Leak 
> -----------------------------------
>
>                 Key: FLINK-11205
>                 URL: https://issues.apache.org/jira/browse/FLINK-11205
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: NS
>            Priority: Critical
>         Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 
> 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. 
> Metaspace is unbounded and grows with every restart. YARN aggressively kill 
> such containers but this affect is immediately seems on different task 
> manager which results in death spiral.
> Task Manager uses dynamic loader as described in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink 
> run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are 
> started for that job. Those JVMs have both Flink framework classes and user 
> code classes in the Java classpath. That means that there is _no dynamic 
> classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started 
> with the Flink framework classes in the classpath. The classes from all jobs 
> that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD 
> classloader.resolve-order=parent-first}} . We also above observed the above 
> behaviour when submitting a Flink job/application directly to YARN (via 
> {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak

Reply via email to