Re: metaspace out-of-memory & error while retrieving the leader gateway

Claude M Wed, 23 Sep 2020 05:30:03 -0700

It was mentioned that this issue may be fixed in 1.10.3 but there is no
1.10.3 docker image here: https://hub.docker.com/_/flink



On Wed, Sep 23, 2020 at 7:14 AM Claude M <claudemur...@gmail.com> wrote:

> In regards to the metaspace memory issue, I was able to get a heap dump
> and the following is the output:
>
> Problem Suspect 1
> One instance of *"java.lang.ref.Finalizer"* loaded by *"<system class
> loader>"* occupies *4,112,624 (11.67%)* bytes. The instance is referenced
> by *sun.misc.Cleaner @ 0xb5d6b520* , loaded by *"<system class loader>"*.
> The memory is accumulated in one instance of *"java.lang.Object[]"* loaded
> by *"<system class loader>"*.
>
> Problem Suspect 2
> 33 instances of *"org.apache.flink.util.ChildFirstClassLoader"*, loaded by
>  *"sun.misc.Launcher$AppClassLoader @ 0xb4068680"* occupy *6,615,416
> (18.76%)*bytes.
>
> Based on this, I'm not clear on what needs to be done to solve this.
>
>
> On Tue, Sep 22, 2020 at 3:10 PM Claude M <claudemur...@gmail.com> wrote:
>
>> Thanks for your responses.
>> 1.  There were no job re-starts prior to the metaspace OEM.
>> 2.  I tried increasing the CPU request and still encountered the
>> problem.  Any configuration change I make to the job manager, whether it's
>> in the flink-conf.yaml or increasing the pod's CPU/memory request, results
>> with this problem.
>>
>>
>> On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <tonysong...@gmail.com>
>> wrote:
>>
>>> Thanks for the input, Brain.
>>>
>>> This looks like what we are looking for. The issue is fixed in 1.10.3,
>>> which also matches this problem occurred in 1.10.2.
>>>
>>> Maybe Claude can further confirm it.
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <b.z...@dell.com> wrote:
>>>
>>>> Hi Xintong and Claude,
>>>>
>>>>
>>>>
>>>> In our internal tests, we also encounter these two issues and we spent
>>>> much time debugging them. There are two points I need to confirm if we
>>>> share the same problem.
>>>>
>>>>    1. Your job is using default restart strategy, which is per-second
>>>>    restart.
>>>>    2. Your CPU resource on jobmanager might be small
>>>>
>>>>
>>>>
>>>> Here is some findings I want to share.
>>>>
>>>> ## Metaspace OOM
>>>>
>>>> Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we
>>>> have some job restarts, there will be some threads from the sourceFunction
>>>> hanging, cause the class loader cannot close. New restarts would load new
>>>> classes, then expand the metaspace, and finally OOM happens.
>>>>
>>>>
>>>>
>>>> ## Leader retrieving
>>>>
>>>> Constant restarts may be heavy for jobmanager, if JM CPU resources are
>>>> not enough, the thread for leader retrieving may be stuck.
>>>>
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>> *From:* Xintong Song <tonysong...@gmail.com>
>>>> *Sent:* Tuesday, September 22, 2020 10:16
>>>> *To:* Claude M; user
>>>> *Subject:* Re: metaspace out-of-memory & error while retrieving the
>>>> leader gateway
>>>>
>>>>
>>>>
>>>> ## Metaspace OOM
>>>>
>>>> As the error message already suggested, the metaspace OOM you
>>>> encountered is likely caused by a class loading leak. I think you are on
>>>> the right direction trying to look into the heap dump and find out where
>>>> the leak comes from. IIUC, after removing the ZK folder, you are now able
>>>> to run Flink with the heap dump options.
>>>>
>>>>
>>>>
>>>> The problem does not occur in previous versions because Flink starts to
>>>> set the metaspace limit since the 1.10 release. The class loading leak
>>>> might have already been there, but is never discovered. This could lead to
>>>> unpredictable stability and performance issues. That's why Flink updated
>>>> its memory model and explicitly set the metaspace limit in the 1.10 
>>>> release.
>>>>
>>>>
>>>>
>>>> ## Leader retrieving
>>>>
>>>> The command looks good to me. If this problem happens only once, it
>>>> could be irrelevant to adding the options. If that does not block you from
>>>> getting the heap dump, we can look into it later.
>>>>
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Sep 21, 2020 at 9:37 PM Claude M <claudemur...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Xintong,
>>>>
>>>>
>>>>
>>>> Thanks for your reply.  Here is the command output w/ the java.opts:
>>>>
>>>>
>>>>
>>>> /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC
>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log
>>>> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
>>>> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
>>>> -classpath
>>>> /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf:
>>>> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint
>>>> --configDir /opt/flink/conf --executionMode cluster
>>>>
>>>>
>>>>
>>>> To answer your questions:
>>>>
>>>>    - Correct, in order for the pod to start up, I have to remove the
>>>>    flink app folder from zookeeper.  I only have to delete once after 
>>>> applying
>>>>    the java.opts arguments.  It doesn't make sense though that I should 
>>>> have
>>>>    to do this just from adding a parameter.
>>>>    - I'm using the standalone deployment.
>>>>    - I'm using job cluster mode.
>>>>
>>>> A higher priority issue I'm trying to solve is this metaspace out of
>>>> memory that is occuring in task managers.  This was not happening before I
>>>> upgraded to Flink 1.10.2.  Even after increasing the memory, I'm still
>>>> encountering the problem.  That is when I added the java.opts argument to
>>>> see if I can get more information about the problem.  That is when I ran
>>>> across the second issue w/ the job manager pod not starting up.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <tonysong...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Claude,
>>>>
>>>>
>>>>
>>>> IIUC, in your case the leader retrieving problem is triggered by adding
>>>> the `java.opts`? Then could you try to find and post the complete command
>>>> for launching the JVM process? You can try log into the pod and execute `ps
>>>> -ef | grep <PID>`.
>>>>
>>>>
>>>>
>>>> A few more questions:
>>>>
>>>> - What do you mean by "resolve this"? Does the jobmanager pod get stuck
>>>> there, and recover when you remove the folder from ZK? Do you have to do
>>>> the removal for everytime submitting the Kubernetes?
>>>>
>>>> The only way I can resolve this is to delete the folder from zookeeper
>>>> which I shouldn't have to do.
>>>>
>>>> - Which Flink's kubernetes deployment are you using? The standalone or
>>>> native Kubernetes?
>>>>
>>>> - Which cluster mode are you using? Job cluster, session cluster, or
>>>> the application mode?
>>>>
>>>>
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>>
>>>>
>>>> I upgraded from Flink 1.7.2 to 1.10.2.  One of the jobs running on the
>>>> task managers is periodically crashing w/ the following error:
>>>>
>>>>
>>>>
>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory
>>>> error has occurred. This can mean two things: either the job requires a
>>>> larger size of JVM metaspace to load classes or there is a class loading
>>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size'
>>>> configuration option should be increased. If the error persists (usually in
>>>> cluster after several job (re-)submissions) then there is probably a class
>>>> loading leak which has to be investigated and fixed. The task executor has
>>>> to be shutdown.
>>>>
>>>>
>>>>
>>>> I found this issue regarding it:
>>>>
>>>> https://issues.apache.org/jira/browse/FLINK-16406
>>>>
>>>>
>>>>
>>>> I have tried increasing the taskmanager.memory.jvm-metaspace.size to
>>>> 256M & 512M and still was having the problem.
>>>>
>>>>
>>>>
>>>> I then added the following to the flink.conf to try to get more
>>>> information about the error:
>>>>
>>>> env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError
>>>> -XX:HeapDumpPath=/opt/flink/log
>>>>
>>>>
>>>>
>>>> When I deployed the change which is in a Kubernetes cluster, the
>>>> jobmanager pod fails to start up and the following message shows
>>>> repeatedly:
>>>>
>>>>
>>>>
>>>> 2020-09-18 17:03:46,255 WARN
>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
>>>> Error while retrieving the leader gateway. Retrying to connect to
>>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>>>>
>>>>
>>>>
>>>> The only way I can resolve this is to delete the folder from zookeeper
>>>> which I shouldn't have to do.
>>>>
>>>>
>>>>
>>>> Any ideas on these issues?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Re: metaspace out-of-memory & error while retrieving the leader gateway

Reply via email to