Re: metaspace out-of-memory & error while retrieving the leader gateway

Claude M Thu, 24 Sep 2020 03:36:17 -0700

I have 35 task managers, 1 slot on each.  I'm running a total of 7 jobs in
the cluster.  All the slots are occupied.  When you say that 33 instances
of the ChildFirstClassLoader does not sound right, what should I be
expecting?  Could the number of jobs running in the cluster contribute to
the out of memory?  I used to have 26 task managers in this cluster w/ 5
jobs.
I added 9 additional task managers and 2 jobs.  I noticed this problem
started occurring after I made these additions.  If this is the cause of
the problem, how can it be resolved?



On Thu, Sep 24, 2020 at 1:06 AM Xintong Song <tonysong...@gmail.com> wrote:

> How many slots do you have on each task manager?
>
> Flink uses ChildFirstClassLoader for loading user codes, to avoid
> dependency conflicts between user codes and Flink's framework. Ideally,
> after a slot is freed and reassigned to a new job, the user class loaders
> of the previous job should be unloaded. 33 instances of them does not
> sound right. It might be worth looking into where the references that keep
> these instances alive come from.
>
> Flink 1.10.3 is not released yet. If you want to try the unreleased
> version, you would need to download the sources [1], build the flink
> distribution [2] and build your custom image (from the 1.0.2 image and
> replace the flink distribution with the one you built).
>
> Thank you~
>
> Xintong Song
>
>
> [1] https://github.com/apache/flink/tree/release-1.10
>
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html
>
>
>
> On Wed, Sep 23, 2020 at 8:29 PM Claude M <claudemur...@gmail.com> wrote:
>
>> It was mentioned that this issue may be fixed in 1.10.3 but there is no
>> 1.10.3 docker image here: https://hub.docker.com/_/flink
>>
>>
>> On Wed, Sep 23, 2020 at 7:14 AM Claude M <claudemur...@gmail.com> wrote:
>>
>>> In regards to the metaspace memory issue, I was able to get a heap dump
>>> and the following is the output:
>>>
>>> Problem Suspect 1
>>> One instance of *"java.lang.ref.Finalizer"* loaded by *"<system class
>>> loader>"* occupies *4,112,624 (11.67%)* bytes. The instance is
>>> referenced by *sun.misc.Cleaner @ 0xb5d6b520* , loaded by *"<system
>>> class loader>"*. The memory is accumulated in one instance of
>>> *"java.lang.Object[]"* loaded by *"<system class loader>"*.
>>>
>>> Problem Suspect 2
>>> 33 instances of *"org.apache.flink.util.ChildFirstClassLoader"*, loaded
>>> by *"sun.misc.Launcher$AppClassLoader @ 0xb4068680"* occupy *6,615,416
>>> (18.76%)*bytes.
>>>
>>> Based on this, I'm not clear on what needs to be done to solve this.
>>>
>>>
>>> On Tue, Sep 22, 2020 at 3:10 PM Claude M <claudemur...@gmail.com> wrote:
>>>
>>>> Thanks for your responses.
>>>> 1.  There were no job re-starts prior to the metaspace OEM.
>>>> 2.  I tried increasing the CPU request and still encountered the
>>>> problem.  Any configuration change I make to the job manager, whether it's
>>>> in the flink-conf.yaml or increasing the pod's CPU/memory request, results
>>>> with this problem.
>>>>
>>>>
>>>> On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <tonysong...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the input, Brain.
>>>>>
>>>>> This looks like what we are looking for. The issue is fixed in 1.10.3,
>>>>> which also matches this problem occurred in 1.10.2.
>>>>>
>>>>> Maybe Claude can further confirm it.
>>>>>
>>>>> Thank you~
>>>>>
>>>>> Xintong Song
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <b.z...@dell.com> wrote:
>>>>>
>>>>>> Hi Xintong and Claude,
>>>>>>
>>>>>>
>>>>>>
>>>>>> In our internal tests, we also encounter these two issues and we
>>>>>> spent much time debugging them. There are two points I need to confirm if
>>>>>> we share the same problem.
>>>>>>
>>>>>>    1. Your job is using default restart strategy, which is
>>>>>>    per-second restart.
>>>>>>    2. Your CPU resource on jobmanager might be small
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here is some findings I want to share.
>>>>>>
>>>>>> ## Metaspace OOM
>>>>>>
>>>>>> Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we
>>>>>> have some job restarts, there will be some threads from the 
>>>>>> sourceFunction
>>>>>> hanging, cause the class loader cannot close. New restarts would load new
>>>>>> classes, then expand the metaspace, and finally OOM happens.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## Leader retrieving
>>>>>>
>>>>>> Constant restarts may be heavy for jobmanager, if JM CPU resources
>>>>>> are not enough, the thread for leader retrieving may be stuck.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Xintong Song <tonysong...@gmail.com>
>>>>>> *Sent:* Tuesday, September 22, 2020 10:16
>>>>>> *To:* Claude M; user
>>>>>> *Subject:* Re: metaspace out-of-memory & error while retrieving the
>>>>>> leader gateway
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## Metaspace OOM
>>>>>>
>>>>>> As the error message already suggested, the metaspace OOM you
>>>>>> encountered is likely caused by a class loading leak. I think you are on
>>>>>> the right direction trying to look into the heap dump and find out where
>>>>>> the leak comes from. IIUC, after removing the ZK folder, you are now able
>>>>>> to run Flink with the heap dump options.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem does not occur in previous versions because Flink starts
>>>>>> to set the metaspace limit since the 1.10 release. The class loading leak
>>>>>> might have already been there, but is never discovered. This could lead 
>>>>>> to
>>>>>> unpredictable stability and performance issues. That's why Flink updated
>>>>>> its memory model and explicitly set the metaspace limit in the 1.10 
>>>>>> release.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## Leader retrieving
>>>>>>
>>>>>> The command looks good to me. If this problem happens only once, it
>>>>>> could be irrelevant to adding the options. If that does not block you 
>>>>>> from
>>>>>> getting the heap dump, we can look into it later.
>>>>>>
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 21, 2020 at 9:37 PM Claude M <claudemur...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Xintong,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for your reply.  Here is the command output w/ the java.opts:
>>>>>>
>>>>>>
>>>>>>
>>>>>> /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC
>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log
>>>>>> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
>>>>>> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
>>>>>> -classpath
>>>>>> /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf:
>>>>>> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint
>>>>>> --configDir /opt/flink/conf --executionMode cluster
>>>>>>
>>>>>>
>>>>>>
>>>>>> To answer your questions:
>>>>>>
>>>>>>    - Correct, in order for the pod to start up, I have to remove the
>>>>>>    flink app folder from zookeeper.  I only have to delete once after 
>>>>>> applying
>>>>>>    the java.opts arguments.  It doesn't make sense though that I should 
>>>>>> have
>>>>>>    to do this just from adding a parameter.
>>>>>>    - I'm using the standalone deployment.
>>>>>>    - I'm using job cluster mode.
>>>>>>
>>>>>> A higher priority issue I'm trying to solve is this metaspace out of
>>>>>> memory that is occuring in task managers.  This was not happening before 
>>>>>> I
>>>>>> upgraded to Flink 1.10.2.  Even after increasing the memory, I'm still
>>>>>> encountering the problem.  That is when I added the java.opts argument to
>>>>>> see if I can get more information about the problem.  That is when I ran
>>>>>> across the second issue w/ the job manager pod not starting up.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <tonysong...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Claude,
>>>>>>
>>>>>>
>>>>>>
>>>>>> IIUC, in your case the leader retrieving problem is triggered by
>>>>>> adding the `java.opts`? Then could you try to find and post the complete
>>>>>> command for launching the JVM process? You can try log into the pod and
>>>>>> execute `ps -ef | grep <PID>`.
>>>>>>
>>>>>>
>>>>>>
>>>>>> A few more questions:
>>>>>>
>>>>>> - What do you mean by "resolve this"? Does the jobmanager pod get
>>>>>> stuck there, and recover when you remove the folder from ZK? Do you have 
>>>>>> to
>>>>>> do the removal for everytime submitting the Kubernetes?
>>>>>>
>>>>>> The only way I can resolve this is to delete the folder from
>>>>>> zookeeper which I shouldn't have to do.
>>>>>>
>>>>>> - Which Flink's kubernetes deployment are you using? The standalone
>>>>>> or native Kubernetes?
>>>>>>
>>>>>> - Which cluster mode are you using? Job cluster, session cluster, or
>>>>>> the application mode?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I upgraded from Flink 1.7.2 to 1.10.2.  One of the jobs running on
>>>>>> the task managers is periodically crashing w/ the following error:
>>>>>>
>>>>>>
>>>>>>
>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory
>>>>>> error has occurred. This can mean two things: either the job requires a
>>>>>> larger size of JVM metaspace to load classes or there is a class loading
>>>>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size'
>>>>>> configuration option should be increased. If the error persists (usually 
>>>>>> in
>>>>>> cluster after several job (re-)submissions) then there is probably a 
>>>>>> class
>>>>>> loading leak which has to be investigated and fixed. The task executor 
>>>>>> has
>>>>>> to be shutdown.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found this issue regarding it:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/FLINK-16406
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have tried increasing the taskmanager.memory.jvm-metaspace.size to
>>>>>> 256M & 512M and still was having the problem.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I then added the following to the flink.conf to try to get more
>>>>>> information about the error:
>>>>>>
>>>>>> env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError
>>>>>> -XX:HeapDumpPath=/opt/flink/log
>>>>>>
>>>>>>
>>>>>>
>>>>>> When I deployed the change which is in a Kubernetes cluster, the
>>>>>> jobmanager pod fails to start up and the following message shows
>>>>>> repeatedly:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2020-09-18 17:03:46,255 WARN
>>>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  
>>>>>> -
>>>>>> Error while retrieving the leader gateway. Retrying to connect to
>>>>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only way I can resolve this is to delete the folder from
>>>>>> zookeeper which I shouldn't have to do.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any ideas on these issues?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: metaspace out-of-memory & error while retrieving the leader gateway

Reply via email to