Hi Claude, IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.
A few more questions: - What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes? > The only way I can resolve this is to delete the folder from zookeeper > which I shouldn't have to do. > - Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes? - Which cluster mode are you using? Job cluster, session cluster, or the application mode? Thank you~ Xintong Song On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com> wrote: > Hello, > > I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the > task managers is periodically crashing w/ the following error: > > java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error > has occurred. This can mean two things: either the job requires a larger > size of JVM metaspace to load classes or there is a class loading leak. In > the first case 'taskmanager.memory.jvm-metaspace.size' configuration option > should be increased. If the error persists (usually in cluster after > several job (re-)submissions) then there is probably a class loading leak > which has to be investigated and fixed. The task executor has to be > shutdown. > > I found this issue regarding it: > https://issues.apache.org/jira/browse/FLINK-16406 > > I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M > & 512M and still was having the problem. > > I then added the following to the flink.conf to try to get more > information about the error: > env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/opt/flink/log > > When I deployed the change which is in a Kubernetes cluster, the > jobmanager pod fails to start up and the following message shows > repeatedly: > > 2020-09-18 17:03:46,255 WARN > org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - > Error while retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. > > The only way I can resolve this is to delete the folder from zookeeper > which I shouldn't have to do. > > Any ideas on these issues? > > > >