I have 35 task managers, 1 slot on each. I'm running a total of 7 jobs in the cluster. All the slots are occupied. When you say that 33 instances of the ChildFirstClassLoader does not sound right, what should I be expecting? Could the number of jobs running in the cluster contribute to the out of memory? I used to have 26 task managers in this cluster w/ 5 jobs. I added 9 additional task managers and 2 jobs. I noticed this problem started occurring after I made these additions. If this is the cause of the problem, how can it be resolved?
On Thu, Sep 24, 2020 at 1:06 AM Xintong Song <tonysong...@gmail.com> wrote: > How many slots do you have on each task manager? > > Flink uses ChildFirstClassLoader for loading user codes, to avoid > dependency conflicts between user codes and Flink's framework. Ideally, > after a slot is freed and reassigned to a new job, the user class loaders > of the previous job should be unloaded. 33 instances of them does not > sound right. It might be worth looking into where the references that keep > these instances alive come from. > > Flink 1.10.3 is not released yet. If you want to try the unreleased > version, you would need to download the sources [1], build the flink > distribution [2] and build your custom image (from the 1.0.2 image and > replace the flink distribution with the one you built). > > Thank you~ > > Xintong Song > > > [1] https://github.com/apache/flink/tree/release-1.10 > > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html > > > > On Wed, Sep 23, 2020 at 8:29 PM Claude M <claudemur...@gmail.com> wrote: > >> It was mentioned that this issue may be fixed in 1.10.3 but there is no >> 1.10.3 docker image here: https://hub.docker.com/_/flink >> >> >> On Wed, Sep 23, 2020 at 7:14 AM Claude M <claudemur...@gmail.com> wrote: >> >>> In regards to the metaspace memory issue, I was able to get a heap dump >>> and the following is the output: >>> >>> Problem Suspect 1 >>> One instance of *"java.lang.ref.Finalizer"* loaded by *"<system class >>> loader>"* occupies *4,112,624 (11.67%)* bytes. The instance is >>> referenced by *sun.misc.Cleaner @ 0xb5d6b520* , loaded by *"<system >>> class loader>"*. The memory is accumulated in one instance of >>> *"java.lang.Object[]"* loaded by *"<system class loader>"*. >>> >>> Problem Suspect 2 >>> 33 instances of *"org.apache.flink.util.ChildFirstClassLoader"*, loaded >>> by *"sun.misc.Launcher$AppClassLoader @ 0xb4068680"* occupy *6,615,416 >>> (18.76%)*bytes. >>> >>> Based on this, I'm not clear on what needs to be done to solve this. >>> >>> >>> On Tue, Sep 22, 2020 at 3:10 PM Claude M <claudemur...@gmail.com> wrote: >>> >>>> Thanks for your responses. >>>> 1. There were no job re-starts prior to the metaspace OEM. >>>> 2. I tried increasing the CPU request and still encountered the >>>> problem. Any configuration change I make to the job manager, whether it's >>>> in the flink-conf.yaml or increasing the pod's CPU/memory request, results >>>> with this problem. >>>> >>>> >>>> On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <tonysong...@gmail.com> >>>> wrote: >>>> >>>>> Thanks for the input, Brain. >>>>> >>>>> This looks like what we are looking for. The issue is fixed in 1.10.3, >>>>> which also matches this problem occurred in 1.10.2. >>>>> >>>>> Maybe Claude can further confirm it. >>>>> >>>>> Thank you~ >>>>> >>>>> Xintong Song >>>>> >>>>> >>>>> >>>>> On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <b.z...@dell.com> wrote: >>>>> >>>>>> Hi Xintong and Claude, >>>>>> >>>>>> >>>>>> >>>>>> In our internal tests, we also encounter these two issues and we >>>>>> spent much time debugging them. There are two points I need to confirm if >>>>>> we share the same problem. >>>>>> >>>>>> 1. Your job is using default restart strategy, which is >>>>>> per-second restart. >>>>>> 2. Your CPU resource on jobmanager might be small >>>>>> >>>>>> >>>>>> >>>>>> Here is some findings I want to share. >>>>>> >>>>>> ## Metaspace OOM >>>>>> >>>>>> Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we >>>>>> have some job restarts, there will be some threads from the >>>>>> sourceFunction >>>>>> hanging, cause the class loader cannot close. New restarts would load new >>>>>> classes, then expand the metaspace, and finally OOM happens. >>>>>> >>>>>> >>>>>> >>>>>> ## Leader retrieving >>>>>> >>>>>> Constant restarts may be heavy for jobmanager, if JM CPU resources >>>>>> are not enough, the thread for leader retrieving may be stuck. >>>>>> >>>>>> >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Brian >>>>>> >>>>>> >>>>>> >>>>>> *From:* Xintong Song <tonysong...@gmail.com> >>>>>> *Sent:* Tuesday, September 22, 2020 10:16 >>>>>> *To:* Claude M; user >>>>>> *Subject:* Re: metaspace out-of-memory & error while retrieving the >>>>>> leader gateway >>>>>> >>>>>> >>>>>> >>>>>> ## Metaspace OOM >>>>>> >>>>>> As the error message already suggested, the metaspace OOM you >>>>>> encountered is likely caused by a class loading leak. I think you are on >>>>>> the right direction trying to look into the heap dump and find out where >>>>>> the leak comes from. IIUC, after removing the ZK folder, you are now able >>>>>> to run Flink with the heap dump options. >>>>>> >>>>>> >>>>>> >>>>>> The problem does not occur in previous versions because Flink starts >>>>>> to set the metaspace limit since the 1.10 release. The class loading leak >>>>>> might have already been there, but is never discovered. This could lead >>>>>> to >>>>>> unpredictable stability and performance issues. That's why Flink updated >>>>>> its memory model and explicitly set the metaspace limit in the 1.10 >>>>>> release. >>>>>> >>>>>> >>>>>> >>>>>> ## Leader retrieving >>>>>> >>>>>> The command looks good to me. If this problem happens only once, it >>>>>> could be irrelevant to adding the options. If that does not block you >>>>>> from >>>>>> getting the heap dump, we can look into it later. >>>>>> >>>>>> >>>>>> Thank you~ >>>>>> >>>>>> Xintong Song >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Sep 21, 2020 at 9:37 PM Claude M <claudemur...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Xintong, >>>>>> >>>>>> >>>>>> >>>>>> Thanks for your reply. Here is the command output w/ the java.opts: >>>>>> >>>>>> >>>>>> >>>>>> /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC >>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log >>>>>> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties >>>>>> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml >>>>>> -classpath >>>>>> /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: >>>>>> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint >>>>>> --configDir /opt/flink/conf --executionMode cluster >>>>>> >>>>>> >>>>>> >>>>>> To answer your questions: >>>>>> >>>>>> - Correct, in order for the pod to start up, I have to remove the >>>>>> flink app folder from zookeeper. I only have to delete once after >>>>>> applying >>>>>> the java.opts arguments. It doesn't make sense though that I should >>>>>> have >>>>>> to do this just from adding a parameter. >>>>>> - I'm using the standalone deployment. >>>>>> - I'm using job cluster mode. >>>>>> >>>>>> A higher priority issue I'm trying to solve is this metaspace out of >>>>>> memory that is occuring in task managers. This was not happening before >>>>>> I >>>>>> upgraded to Flink 1.10.2. Even after increasing the memory, I'm still >>>>>> encountering the problem. That is when I added the java.opts argument to >>>>>> see if I can get more information about the problem. That is when I ran >>>>>> across the second issue w/ the job manager pod not starting up. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <tonysong...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Claude, >>>>>> >>>>>> >>>>>> >>>>>> IIUC, in your case the leader retrieving problem is triggered by >>>>>> adding the `java.opts`? Then could you try to find and post the complete >>>>>> command for launching the JVM process? You can try log into the pod and >>>>>> execute `ps -ef | grep <PID>`. >>>>>> >>>>>> >>>>>> >>>>>> A few more questions: >>>>>> >>>>>> - What do you mean by "resolve this"? Does the jobmanager pod get >>>>>> stuck there, and recover when you remove the folder from ZK? Do you have >>>>>> to >>>>>> do the removal for everytime submitting the Kubernetes? >>>>>> >>>>>> The only way I can resolve this is to delete the folder from >>>>>> zookeeper which I shouldn't have to do. >>>>>> >>>>>> - Which Flink's kubernetes deployment are you using? The standalone >>>>>> or native Kubernetes? >>>>>> >>>>>> - Which cluster mode are you using? Job cluster, session cluster, or >>>>>> the application mode? >>>>>> >>>>>> >>>>>> >>>>>> Thank you~ >>>>>> >>>>>> Xintong Song >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> >>>>>> >>>>>> I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on >>>>>> the task managers is periodically crashing w/ the following error: >>>>>> >>>>>> >>>>>> >>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory >>>>>> error has occurred. This can mean two things: either the job requires a >>>>>> larger size of JVM metaspace to load classes or there is a class loading >>>>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size' >>>>>> configuration option should be increased. If the error persists (usually >>>>>> in >>>>>> cluster after several job (re-)submissions) then there is probably a >>>>>> class >>>>>> loading leak which has to be investigated and fixed. The task executor >>>>>> has >>>>>> to be shutdown. >>>>>> >>>>>> >>>>>> >>>>>> I found this issue regarding it: >>>>>> >>>>>> https://issues.apache.org/jira/browse/FLINK-16406 >>>>>> >>>>>> >>>>>> >>>>>> I have tried increasing the taskmanager.memory.jvm-metaspace.size to >>>>>> 256M & 512M and still was having the problem. >>>>>> >>>>>> >>>>>> >>>>>> I then added the following to the flink.conf to try to get more >>>>>> information about the error: >>>>>> >>>>>> env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError >>>>>> -XX:HeapDumpPath=/opt/flink/log >>>>>> >>>>>> >>>>>> >>>>>> When I deployed the change which is in a Kubernetes cluster, the >>>>>> jobmanager pod fails to start up and the following message shows >>>>>> repeatedly: >>>>>> >>>>>> >>>>>> >>>>>> 2020-09-18 17:03:46,255 WARN >>>>>> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever >>>>>> - >>>>>> Error while retrieving the leader gateway. Retrying to connect to >>>>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. >>>>>> >>>>>> >>>>>> >>>>>> The only way I can resolve this is to delete the folder from >>>>>> zookeeper which I shouldn't have to do. >>>>>> >>>>>> >>>>>> >>>>>> Any ideas on these issues? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>