Hello everybody, I was writing this email when a similar thread on this mailing list appeared.. The difference is that the other problem seems to be related with Flink 1.10 on YARN and does not output anything helpful in debugging the cause of the problem.
Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone session cluster (the job is submitted to the cluster using the CLI client). The problem arises when I submit the same job for about 20 times (this number unfortunately is not deterministic and can change a little bit). The error reported by the Task Executor is related to the ever growing Metaspace..the error seems to be pretty detailed [1]. I found the same issue in some previous threads on this mailing list and I've tried to figure it out the cause of the problem. The issue is that looking at the objects allocated I don't really get an idea of the source of the problem because the type of objects that are consuming the memory are of general purpose (i.e. Bytes, Integers and Strings)...these are my "top" memory consumers if looking at the output of jmap -histo <PID>: At run 0: num #instances #bytes class name (module) ------------------------------------------------------- 1: 46238 13224056 [B (java.base@11.0.9.1) 2: 3736 6536672 [I (java.base@11.0.9.1) 3: 38081 913944 java.lang.String (java.base@11.0.9.1) 4: 26 852384 [Lakka.dispatch.forkjoin.ForkJoinTask; 5: 7146 844984 java.lang.Class (java.base@11.0.9.1) At run 1: 1: 77.608 25.317.496 [B (java.base@11.0.9.1) 2: 7.004 9.088.360 [I (java.base@11.0.9.1) 3: 15.814 1.887.256 java.lang.Class (java.base@11.0.9.1) 4: 67.381 1.617.144 java.lang.String (java.base@11.0.9.1) 5: 3.906 1.422.960 [Ljava.util.HashMap$Node; ( java.base@11.0.9.1) At run 6: 1: 81.408 25.375.400 [B (java.base@11.0.9.1) 2: 12.479 7.249.392 [I (java.base@11.0.9.1) 3: 29.090 3.496.168 java.lang.Class (java.base@11.0.9.1) 4: 4.347 2.813.416 [Ljava.util.HashMap$Node; ( java.base@11.0.9.1) 5: 71.584 1.718.016 java.lang.String (java.base@11.0.9.1) At run 8: 1: 985.979 127.193.256 [B (java.base@11.0.9.1) 2: 35.400 13.702.112 [I (java.base@11.0.9.1) 3: 260.387 6.249.288 java.lang.String (java.base@11.0.9.1) 4: 148.836 5.953.440 java.util.HashMap$KeyIterator ( java.base@11.0.9.1) 5: 17.641 5.222.344 [Ljava.util.HashMap$Node; ( java.base@11.0.9.1) Thanks in advance for any help, Flavio [1] -------------------------------------------------------------------------------------------------- java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?] at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?] at java.net.URLClassLoader.defineClass(URLClassLoader.java:550) ~[?:?] at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?] at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?] at java.security.AccessController.doPrivileged(Native Method) ~[?:?] at java.net.URLClassLoader.findClass(URLClassLoader.java:451) ~[?:?] at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) [flink-dist_2.12-1.11.0.jar:1.11.0] at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]