## Metaspace OOM As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.
The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release. ## Leader retrieving The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later. Thank you~ Xintong Song On Mon, Sep 21, 2020 at 9:37 PM Claude M <claudemur...@gmail.com> wrote: > Hi Xintong, > > Thanks for your reply. Here is the command output w/ the java.opts: > > /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log > -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties > -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml > -classpath > /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: > org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint > --configDir /opt/flink/conf --executionMode cluster > > To answer your questions: > > - Correct, in order for the pod to start up, I have to remove the > flink app folder from zookeeper. I only have to delete once after applying > the java.opts arguments. It doesn't make sense though that I should have > to do this just from adding a parameter. > - I'm using the standalone deployment. > - I'm using job cluster mode. > > A higher priority issue I'm trying to solve is this metaspace out of > memory that is occuring in task managers. This was not happening before I > upgraded to Flink 1.10.2. Even after increasing the memory, I'm still > encountering the problem. That is when I added the java.opts argument to > see if I can get more information about the problem. That is when I ran > across the second issue w/ the job manager pod not starting up. > > > Thanks > > > On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <tonysong...@gmail.com> > wrote: > >> Hi Claude, >> >> IIUC, in your case the leader retrieving problem is triggered by adding >> the `java.opts`? Then could you try to find and post the complete command >> for launching the JVM process? You can try log into the pod and execute `ps >> -ef | grep <PID>`. >> >> A few more questions: >> - What do you mean by "resolve this"? Does the jobmanager pod get stuck >> there, and recover when you remove the folder from ZK? Do you have to do >> the removal for everytime submitting the Kubernetes? >> >>> The only way I can resolve this is to delete the folder from zookeeper >>> which I shouldn't have to do. >>> >> - Which Flink's kubernetes deployment are you using? The standalone or >> native Kubernetes? >> - Which cluster mode are you using? Job cluster, session cluster, or the >> application mode? >> >> Thank you~ >> >> Xintong Song >> >> >> >> On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com> wrote: >> >>> Hello, >>> >>> I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the >>> task managers is periodically crashing w/ the following error: >>> >>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error >>> has occurred. This can mean two things: either the job requires a larger >>> size of JVM metaspace to load classes or there is a class loading leak. In >>> the first case 'taskmanager.memory.jvm-metaspace.size' configuration option >>> should be increased. If the error persists (usually in cluster after >>> several job (re-)submissions) then there is probably a class loading leak >>> which has to be investigated and fixed. The task executor has to be >>> shutdown. >>> >>> I found this issue regarding it: >>> https://issues.apache.org/jira/browse/FLINK-16406 >>> >>> I have tried increasing the taskmanager.memory.jvm-metaspace.size to >>> 256M & 512M and still was having the problem. >>> >>> I then added the following to the flink.conf to try to get more >>> information about the error: >>> env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError >>> -XX:HeapDumpPath=/opt/flink/log >>> >>> When I deployed the change which is in a Kubernetes cluster, the >>> jobmanager pod fails to start up and the following message shows >>> repeatedly: >>> >>> 2020-09-18 17:03:46,255 WARN >>> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - >>> Error while retrieving the leader gateway. Retrying to connect to >>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. >>> >>> The only way I can resolve this is to delete the folder from zookeeper >>> which I shouldn't have to do. >>> >>> Any ideas on these issues? >>> >>> >>> >>>