Hello! The JVM metaspace is where all the classes (not class instances or objects) get loaded. jmap -histo is going to show you the heap space usage info not the metaspace.
You could inspect what is happening in the metaspace by using jcmd (e.g., jcmd JPID VM.native_memory summary) after restarting the cluster with " *-XX:NativeMemoryTracking=summary"* *As the error message suggests, you may need to increase *taskmanager.memory.jvm-metaspace.size, but you need to be slightly careful when specifying the memory parameters in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error message). Anothing thing to keep in mind is that you may want to avoid using dynamic classloading ( https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code): when the job continuously fails for some temporary issues, it will load the same class files into the metaspace multiple times and could exceed whatever the limit you set it. -K On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský <je...@seznam.cz> wrote: > The exclusions should not have any impact on that, because what defines > which classloader will load which class is not the presence or particular > class in a specific jar, but the configuration of parent-first-patterns [1]. > > If you don't use any flink internal imports, than it still might be the > case, that a class in any of the packages defined by the > parent-first-pattern to hold reference to your user-code classes, which > would cause the leak. I'd recommend to inspect the heap dump after several > restarts of the application and look for reference to Class objects from > the root set. > > Jan > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading > <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html*class-loading__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaX4_W2wJA$> > On 11/16/20 5:34 PM, Flavio Pompermaier wrote: > > I've tried to remove all possible imports of classes not contained in the > fat jar but I still face the same problem. > I've also tried to reduce as much as possible the exclude in the shade > section of the maven plugin (I took the one at [1]) so now I exclude only > few dependencies..could it be that I should include org.slf4j:* if I use > static import of it? > > <artifactSet> > <excludes> > <exclude>com.google.code.findbugs:jsr305</exclude> > <exclude>org.slf4j:*</exclude> > <exclude>log4j:*</exclude> > </excludes> > </artifactSet> > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies > <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html*appendix-template-for-building-a-jar-with-dependencies__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaWGhZYoqQ$> > > On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz> wrote: > >> Yes, that could definitely cause this. You should probably avoid using >> these flink-internal shaded classes and ship your own versions (not shaded). >> >> Best, >> >> Jan >> On 11/16/20 3:22 PM, Flavio Pompermaier wrote: >> >> Thank you Jan for your valuable feedback. >> Could it be that I should not use import shaded-jackson classes in my >> user code? >> For example import >> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper? >> >> Bets, >> Flavio >> >> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz> wrote: >> >>> Hi Flavio, >>> >>> when I encountered quite similar problem that you describe, it was >>> related to a static storage located in class that was loaded >>> "parent-first". In my case it was it was in java.lang.ClassValue, but it >>> might (and probably will be) different in your case. The problem is that if >>> user-code registers something in some (static) storage located in class >>> loaded with parent (TaskTracker) classloader, then its associated classes >>> will never be GC'd and Metaspace will grow. A good starting point would be >>> not to focus on biggest consumers of heap (in general), but to look at >>> where the 15k objects of type Class are referenced from. That might help >>> you figure this out. I'm not sure if there is something that can be done in >>> general to prevent this type of leaks. That would be probably question on >>> dev@ mailing list. >>> >>> Best, >>> >>> Jan >>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote: >>> >>> Hello everybody, >>> I was writing this email when a similar thread on this mailing list >>> appeared.. >>> The difference is that the other problem seems to be related with Flink >>> 1.10 on YARN and does not output anything helpful in debugging the cause of >>> the problem. >>> >>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone >>> session cluster (the job is submitted to the cluster using the CLI client). >>> The problem arises when I submit the same job for about 20 times (this >>> number unfortunately is not deterministic and can change a little bit). The >>> error reported by the Task Executor is related to the ever growing >>> Metaspace..the error seems to be pretty detailed [1]. >>> >>> I found the same issue in some previous threads on this mailing list and >>> I've tried to figure it out the cause of the problem. The issue is that >>> looking at the objects allocated I don't really get an idea of the source >>> of the problem because the type of objects that are consuming the memory >>> are of general purpose (i.e. Bytes, Integers and Strings)...these are my >>> "top" memory consumers if looking at the output of jmap -histo <PID>: >>> >>> At run 0: >>> >>> num #instances #bytes class name (module) >>> ------------------------------------------------------- >>> 1: 46238 13224056 [B (java.base@11.0.9.1) >>> 2: 3736 6536672 [I (java.base@11.0.9.1) >>> 3: 38081 913944 java.lang.String (java.base@11.0.9.1 >>> ) >>> 4: 26 852384 >>> [Lakka.dispatch.forkjoin.ForkJoinTask; >>> 5: 7146 844984 java.lang.Class (java.base@11.0.9.1) >>> >>> At run 1: >>> >>> 1: 77.608 25.317.496 [B (java.base@11.0.9.1) >>> 2: 7.004 9.088.360 [I (java.base@11.0.9.1) >>> 3: 15.814 1.887.256 java.lang.Class ( >>> java.base@11.0.9.1) >>> 4: 67.381 1.617.144 java.lang.String ( >>> java.base@11.0.9.1) >>> 5: 3.906 1.422.960 [Ljava.util.HashMap$Node; ( >>> java.base@11.0.9.1) >>> >>> At run 6: >>> >>> 1: 81.408 25.375.400 [B (java.base@11.0.9.1) >>> 2: 12.479 7.249.392 [I (java.base@11.0.9.1) >>> 3: 29.090 3.496.168 java.lang.Class ( >>> java.base@11.0.9.1) >>> 4: 4.347 2.813.416 [Ljava.util.HashMap$Node; ( >>> java.base@11.0.9.1) >>> 5: 71.584 1.718.016 java.lang.String ( >>> java.base@11.0.9.1) >>> >>> At run 8: >>> >>> 1: 985.979 127.193.256 [B (java.base@11.0.9.1) >>> 2: 35.400 13.702.112 [I (java.base@11.0.9.1) >>> 3: 260.387 6.249.288 java.lang.String ( >>> java.base@11.0.9.1) >>> 4: 148.836 5.953.440 java.util.HashMap$KeyIterator ( >>> java.base@11.0.9.1) >>> 5: 17.641 5.222.344 [Ljava.util.HashMap$Node; ( >>> java.base@11.0.9.1) >>> >>> Thanks in advance for any help, >>> Flavio >>> >>> [1] >>> -------------------------------------------------------------------------------------------------- >>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error >>> has occurred. This can mean two things: either the job requires a larger >>> size of JVM metaspace to load classes or there is a class loading leak. In >>> the first case 'taskmanager.memory.jvm-metaspace.size' configuration option >>> should be increased. If the error persists (usually in cluster after >>> several job (re-)submissions) then there is probably a class loading leak >>> in user code or some of its dependencies which has to be investigated and >>> fixed. The task executor has to be shutdown... >>> at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?] >>> at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) >>> ~[?:?] >>> at >>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) >>> ~[?:?] >>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:550) >>> ~[?:?] >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?] >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?] >>> at java.security.AccessController.doPrivileged(Native Method) >>> ~[?:?] >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:451) >>> ~[?:?] >>> at >>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71) >>> ~[flink-dist_2.12-1.11.0.jar:1.11.0] >>> at >>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) >>> [flink-dist_2.12-1.11.0.jar:1.11.0] >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?] >>> >>> ______________________________________________________________________ The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.