RE: metaspace out-of-memory & error while retrieving the leader gateway

Zhou, Brian Mon, 21 Sep 2020 19:58:47 -0700

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much 
time debugging them. There are two points I need to confirm if we share the 
same problem.


  1.  Your job is using default restart strategy, which is per-second restart.
  2.  Your CPU resource on jobmanager might be small

Here is some findings I want to share.
## Metaspace OOM
Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some 
job restarts, there will be some threads from the sourceFunction hanging, cause 
the class loader cannot close. New restarts would load new classes, then expand 
the metaspace, and finally OOM happens.

## Leader retrieving
Constant restarts may be heavy for jobmanager, if JM CPU resources are not 
enough, the thread for leader retrieving may be stuck.

Best Regards,
Brian

From: Xintong Song <tonysong...@gmail.com>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM
As the error message already suggested, the metaspace OOM you encountered is 
likely caused by a class loading leak. I think you are on the right direction 
trying to look into the heap dump and find out where the leak comes from. IIUC, 
after removing the ZK folder, you are now able to run Flink with the heap dump 
options.

The problem does not occur in previous versions because Flink starts to set the 
metaspace limit since the 1.10 release. The class loading leak might have 
already been there, but is never discovered. This could lead to unpredictable 
stability and performance issues. That's why Flink updated its memory model and 
explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving
The command looks good to me. If this problem happens only once, it could be 
irrelevant to adding the options. If that does not block you from getting the 
heap dump, we can look into it later.


Thank you~

Xintong Song


On Mon, Sep 21, 2020 at 9:37 PM Claude M 
<claudemur...@gmail.com<mailto:claudemur...@gmail.com>> wrote:
Hi Xintong,

Thanks for your reply.  Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath 
/opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf:
 org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint 
--configDir /opt/flink/conf --executionMode cluster

To answer your questions:

  *   Correct, in order for the pod to start up, I have to remove the flink app 
folder from zookeeper.  I only have to delete once after applying the java.opts 
arguments.  It doesn't make sense though that I should have to do this just 
from adding a parameter.
  *   I'm using the standalone deployment.
  *   I'm using job cluster mode.
A higher priority issue I'm trying to solve is this metaspace out of memory 
that is occuring in task managers.  This was not happening before I upgraded to 
Flink 1.10.2.  Even after increasing the memory, I'm still encountering the 
problem.  That is when I added the java.opts argument to see if I can get more 
information about the problem.  That is when I ran across the second issue w/ 
the job manager pod not starting up.


Thanks


On Sun, Sep 20, 2020 at 10:23 PM Xintong Song 
<tonysong...@gmail.com<mailto:tonysong...@gmail.com>> wrote:
Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the 
`java.opts`? Then could you try to find and post the complete command for 
launching the JVM process? You can try log into the pod and execute `ps -ef | 
grep <PID>`.

A few more questions:
- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, 
and recover when you remove the folder from ZK? Do you have to do the removal 
for everytime submitting the Kubernetes?
The only way I can resolve this is to delete the folder from zookeeper which I 
shouldn't have to do.
- Which Flink's kubernetes deployment are you using? The standalone or native 
Kubernetes?
- Which cluster mode are you using? Job cluster, session cluster, or the 
application mode?


Thank you~

Xintong Song


On Sat, Sep 19, 2020 at 1:22 AM Claude M 
<claudemur...@gmail.com<mailto:claudemur...@gmail.com>> wrote:
Hello,

I upgraded from Flink 1.7.2 to 1.10.2.  One of the jobs running on the task 
managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has 
occurred. This can mean two things: either the job requires a larger size of 
JVM metaspace to load classes or there is a class loading leak. In the first 
case 'taskmanager.memory.jvm-metaspace.size' configuration option should be 
increased. If the error persists (usually in cluster after several job 
(re-)submissions) then there is probably a class loading leak which has to be 
investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:
https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 
512M and still was having the problem.

I then added the following to the flink.conf to try to get more information 
about the error:
env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod 
fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN  
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  - Error 
while retrieving the leader gateway. Retrying to connect to 
akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I 
shouldn't have to do.

Any ideas on these issues?

RE: metaspace out-of-memory & error while retrieving the leader gateway

Reply via email to