Hello,

I upgraded from Flink 1.7.2 to 1.10.2.  One of the jobs running on the task
managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
has occurred. This can mean two things: either the job requires a larger
size of JVM metaspace to load classes or there is a class loading leak. In
the first case 'taskmanager.memory.jvm-metaspace.size' configuration option
should be increased. If the error persists (usually in cluster after
several job (re-)submissions) then there is probably a class loading leak
which has to be investigated and fixed. The task executor has to be
shutdown.

I found this issue regarding it:
https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M &
512M and still was having the problem.

I then added the following to the flink.conf to try to get more information
about the error:
env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager
pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN
 org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper
which I shouldn't have to do.

Any ideas on these issues?

Reply via email to