It was mentioned that this issue may be fixed in 1.10.3 but there is no 1.10.3 docker image here: https://hub.docker.com/_/flink
On Wed, Sep 23, 2020 at 7:14 AM Claude M <claudemur...@gmail.com> wrote: > In regards to the metaspace memory issue, I was able to get a heap dump > and the following is the output: > > Problem Suspect 1 > One instance of *"java.lang.ref.Finalizer"* loaded by *"<system class > loader>"* occupies *4,112,624 (11.67%)* bytes. The instance is referenced > by *sun.misc.Cleaner @ 0xb5d6b520* , loaded by *"<system class loader>"*. > The memory is accumulated in one instance of *"java.lang.Object[]"* loaded > by *"<system class loader>"*. > > Problem Suspect 2 > 33 instances of *"org.apache.flink.util.ChildFirstClassLoader"*, loaded by > *"sun.misc.Launcher$AppClassLoader @ 0xb4068680"* occupy *6,615,416 > (18.76%)*bytes. > > Based on this, I'm not clear on what needs to be done to solve this. > > > On Tue, Sep 22, 2020 at 3:10 PM Claude M <claudemur...@gmail.com> wrote: > >> Thanks for your responses. >> 1. There were no job re-starts prior to the metaspace OEM. >> 2. I tried increasing the CPU request and still encountered the >> problem. Any configuration change I make to the job manager, whether it's >> in the flink-conf.yaml or increasing the pod's CPU/memory request, results >> with this problem. >> >> >> On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <tonysong...@gmail.com> >> wrote: >> >>> Thanks for the input, Brain. >>> >>> This looks like what we are looking for. The issue is fixed in 1.10.3, >>> which also matches this problem occurred in 1.10.2. >>> >>> Maybe Claude can further confirm it. >>> >>> Thank you~ >>> >>> Xintong Song >>> >>> >>> >>> On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <b.z...@dell.com> wrote: >>> >>>> Hi Xintong and Claude, >>>> >>>> >>>> >>>> In our internal tests, we also encounter these two issues and we spent >>>> much time debugging them. There are two points I need to confirm if we >>>> share the same problem. >>>> >>>> 1. Your job is using default restart strategy, which is per-second >>>> restart. >>>> 2. Your CPU resource on jobmanager might be small >>>> >>>> >>>> >>>> Here is some findings I want to share. >>>> >>>> ## Metaspace OOM >>>> >>>> Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we >>>> have some job restarts, there will be some threads from the sourceFunction >>>> hanging, cause the class loader cannot close. New restarts would load new >>>> classes, then expand the metaspace, and finally OOM happens. >>>> >>>> >>>> >>>> ## Leader retrieving >>>> >>>> Constant restarts may be heavy for jobmanager, if JM CPU resources are >>>> not enough, the thread for leader retrieving may be stuck. >>>> >>>> >>>> >>>> Best Regards, >>>> >>>> Brian >>>> >>>> >>>> >>>> *From:* Xintong Song <tonysong...@gmail.com> >>>> *Sent:* Tuesday, September 22, 2020 10:16 >>>> *To:* Claude M; user >>>> *Subject:* Re: metaspace out-of-memory & error while retrieving the >>>> leader gateway >>>> >>>> >>>> >>>> ## Metaspace OOM >>>> >>>> As the error message already suggested, the metaspace OOM you >>>> encountered is likely caused by a class loading leak. I think you are on >>>> the right direction trying to look into the heap dump and find out where >>>> the leak comes from. IIUC, after removing the ZK folder, you are now able >>>> to run Flink with the heap dump options. >>>> >>>> >>>> >>>> The problem does not occur in previous versions because Flink starts to >>>> set the metaspace limit since the 1.10 release. The class loading leak >>>> might have already been there, but is never discovered. This could lead to >>>> unpredictable stability and performance issues. That's why Flink updated >>>> its memory model and explicitly set the metaspace limit in the 1.10 >>>> release. >>>> >>>> >>>> >>>> ## Leader retrieving >>>> >>>> The command looks good to me. If this problem happens only once, it >>>> could be irrelevant to adding the options. If that does not block you from >>>> getting the heap dump, we can look into it later. >>>> >>>> >>>> Thank you~ >>>> >>>> Xintong Song >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Sep 21, 2020 at 9:37 PM Claude M <claudemur...@gmail.com> >>>> wrote: >>>> >>>> Hi Xintong, >>>> >>>> >>>> >>>> Thanks for your reply. Here is the command output w/ the java.opts: >>>> >>>> >>>> >>>> /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC >>>> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log >>>> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties >>>> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml >>>> -classpath >>>> /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: >>>> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint >>>> --configDir /opt/flink/conf --executionMode cluster >>>> >>>> >>>> >>>> To answer your questions: >>>> >>>> - Correct, in order for the pod to start up, I have to remove the >>>> flink app folder from zookeeper. I only have to delete once after >>>> applying >>>> the java.opts arguments. It doesn't make sense though that I should >>>> have >>>> to do this just from adding a parameter. >>>> - I'm using the standalone deployment. >>>> - I'm using job cluster mode. >>>> >>>> A higher priority issue I'm trying to solve is this metaspace out of >>>> memory that is occuring in task managers. This was not happening before I >>>> upgraded to Flink 1.10.2. Even after increasing the memory, I'm still >>>> encountering the problem. That is when I added the java.opts argument to >>>> see if I can get more information about the problem. That is when I ran >>>> across the second issue w/ the job manager pod not starting up. >>>> >>>> >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <tonysong...@gmail.com> >>>> wrote: >>>> >>>> Hi Claude, >>>> >>>> >>>> >>>> IIUC, in your case the leader retrieving problem is triggered by adding >>>> the `java.opts`? Then could you try to find and post the complete command >>>> for launching the JVM process? You can try log into the pod and execute `ps >>>> -ef | grep <PID>`. >>>> >>>> >>>> >>>> A few more questions: >>>> >>>> - What do you mean by "resolve this"? Does the jobmanager pod get stuck >>>> there, and recover when you remove the folder from ZK? Do you have to do >>>> the removal for everytime submitting the Kubernetes? >>>> >>>> The only way I can resolve this is to delete the folder from zookeeper >>>> which I shouldn't have to do. >>>> >>>> - Which Flink's kubernetes deployment are you using? The standalone or >>>> native Kubernetes? >>>> >>>> - Which cluster mode are you using? Job cluster, session cluster, or >>>> the application mode? >>>> >>>> >>>> >>>> Thank you~ >>>> >>>> Xintong Song >>>> >>>> >>>> >>>> >>>> >>>> On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com> >>>> wrote: >>>> >>>> Hello, >>>> >>>> >>>> >>>> I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the >>>> task managers is periodically crashing w/ the following error: >>>> >>>> >>>> >>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory >>>> error has occurred. This can mean two things: either the job requires a >>>> larger size of JVM metaspace to load classes or there is a class loading >>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size' >>>> configuration option should be increased. If the error persists (usually in >>>> cluster after several job (re-)submissions) then there is probably a class >>>> loading leak which has to be investigated and fixed. The task executor has >>>> to be shutdown. >>>> >>>> >>>> >>>> I found this issue regarding it: >>>> >>>> https://issues.apache.org/jira/browse/FLINK-16406 >>>> >>>> >>>> >>>> I have tried increasing the taskmanager.memory.jvm-metaspace.size to >>>> 256M & 512M and still was having the problem. >>>> >>>> >>>> >>>> I then added the following to the flink.conf to try to get more >>>> information about the error: >>>> >>>> env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError >>>> -XX:HeapDumpPath=/opt/flink/log >>>> >>>> >>>> >>>> When I deployed the change which is in a Kubernetes cluster, the >>>> jobmanager pod fails to start up and the following message shows >>>> repeatedly: >>>> >>>> >>>> >>>> 2020-09-18 17:03:46,255 WARN >>>> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - >>>> Error while retrieving the leader gateway. Retrying to connect to >>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. >>>> >>>> >>>> >>>> The only way I can resolve this is to delete the folder from zookeeper >>>> which I shouldn't have to do. >>>> >>>> >>>> >>>> Any ideas on these issues? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>