Hello everybody,
I was writing this email when a similar thread on this mailing list
appeared..
The difference is that the other problem seems to be related with Flink
1.10 on YARN and does not output anything helpful in debugging the cause of
the problem.

Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone session
cluster (the job is submitted to the cluster using the CLI client).
The problem arises when I submit the same job for about 20 times (this
number unfortunately is not deterministic and can change a little bit). The
error reported by the Task Executor is related to the ever growing
Metaspace..the error seems to be pretty detailed [1].

I found the same issue in some previous threads on this mailing list and
I've tried to figure it out the cause of the problem. The issue is that
looking at the objects allocated I don't really get an idea of the source
of the problem because the type of objects that are consuming the memory
are of general purpose (i.e. Bytes, Integers and Strings)...these are my
"top" memory consumers if looking at the output of  jmap -histo <PID>:

At run 0:

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:         46238       13224056  [B ([email protected])
   2:          3736        6536672  [I ([email protected])
   3:         38081         913944  java.lang.String ([email protected])
   4:            26         852384  [Lakka.dispatch.forkjoin.ForkJoinTask;
   5:          7146         844984  java.lang.Class ([email protected])

At run 1:

   1:         77.608       25.317.496  [B ([email protected])
   2:          7.004        9.088.360  [I ([email protected])
   3:         15.814        1.887.256  java.lang.Class ([email protected])
   4:         67.381        1.617.144  java.lang.String ([email protected])
   5:          3.906        1.422.960  [Ljava.util.HashMap$Node; (
[email protected])

At run 6:

   1:         81.408       25.375.400  [B ([email protected])
   2:         12.479        7.249.392  [I ([email protected])
   3:         29.090        3.496.168  java.lang.Class ([email protected])
   4:          4.347        2.813.416  [Ljava.util.HashMap$Node; (
[email protected])
   5:         71.584        1.718.016  java.lang.String ([email protected])

At run 8:

   1:        985.979      127.193.256  [B ([email protected])
   2:         35.400       13.702.112  [I ([email protected])
   3:        260.387        6.249.288  java.lang.String ([email protected])
   4:        148.836        5.953.440  java.util.HashMap$KeyIterator (
[email protected])
   5:         17.641        5.222.344  [Ljava.util.HashMap$Node; (
[email protected])

Thanks in advance for any help,
Flavio

[1]
--------------------------------------------------------------------------------------------------
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
has occurred. This can mean two things: either the job requires a larger
size of JVM metaspace to load classes or there is a class loading leak. In
the first case 'taskmanager.memory.jvm-metaspace.size' configuration option
should be increased. If the error persists (usually in cluster after
several job (re-)submissions) then there is probably a class loading leak
in user code or some of its dependencies which has to be investigated and
fixed. The task executor has to be shutdown...
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?]
        at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
~[?:?]
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:550)
~[?:?]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
        at java.net.URLClassLoader.findClass(URLClassLoader.java:451) ~[?:?]
        at
org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
        at
org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
[flink-dist_2.12-1.11.0.jar:1.11.0]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]

Reply via email to