Hello, I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown. I found this issue regarding it: https://issues.apache.org/jira/browse/FLINK-16406 I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem. I then added the following to the flink.conf to try to get more information about the error: env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly: 2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do. Any ideas on these issues?