I am seeing an issue with class loaders not being GCed and the metaspace eventually OOM. Here is my setup:
- Flink 1.13.1 on EMR using JDK 8 in session mode - Job manager is a long-running yarn session - New jobs are submitted every 5m (and typically run for less than 5m) I find that after a few hours the job manager gets killed with Metaspace OOM. I tried increasing the Metaspace for the job manager but that only delays the OOM. I did some debugging using jcmd and I noticed that the size of the classes loaded is always increasing. Next I did a heap dump and found that instances of org.apache.flink.util.ChildFirstClassLoader are present long after the jobs complete. Checking the GC roots I found that there is a reference in java.io.ObjectStreamClass$Caches. Seems to be this JDK issue: https://bugs.openjdk.java.net/browse/JDK-8277072 Curious if there are any workarounds for this situation?