Thank you Kye for your insights...in my mind, if the job runs without problems one or more times the heap size, and thus the medatadata-size, is big enough and I should not increase it (on the same data of course). So I'll try to understand who is leaking what..the advice to avoid the dynamic class loading is just a workaround to me..there's something wrong going on and tomorrow I'll try to understand the root cause of the problem using -XX:NativeMemoryTracking=summary as you suggested.
I'll keep you up to date with my findings.. Best, Flavio On Mon, Nov 16, 2020 at 8:22 PM Kye Bae <kye....@capitalone.com> wrote: > Hello! > > The JVM metaspace is where all the classes (not class instances or > objects) get loaded. jmap -histo is going to show you the heap space usage > info not the metaspace. > > You could inspect what is happening in the metaspace by using jcmd (e.g., > jcmd JPID VM.native_memory summary) after restarting the cluster with " > *-XX:NativeMemoryTracking=summary"* > > *As the error message suggests, you may need to increase > *taskmanager.memory.jvm-metaspace.size, > but you need to be slightly careful when specifying the memory parameters > in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error > message). > > Anothing thing to keep in mind is that you may want to avoid using dynamic > classloading ( > https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code): > when the job continuously fails for some temporary issues, it will load the > same class files into the metaspace multiple times and could exceed > whatever the limit you set it. > > -K > > On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský <je...@seznam.cz> wrote: > >> The exclusions should not have any impact on that, because what defines >> which classloader will load which class is not the presence or particular >> class in a specific jar, but the configuration of parent-first-patterns [1]. >> >> If you don't use any flink internal imports, than it still might be the >> case, that a class in any of the packages defined by the >> parent-first-pattern to hold reference to your user-code classes, which >> would cause the leak. I'd recommend to inspect the heap dump after several >> restarts of the application and look for reference to Class objects from >> the root set. >> >> Jan >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading >> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html*class-loading__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaX4_W2wJA$> >> On 11/16/20 5:34 PM, Flavio Pompermaier wrote: >> >> I've tried to remove all possible imports of classes not contained in the >> fat jar but I still face the same problem. >> I've also tried to reduce as much as possible the exclude in the shade >> section of the maven plugin (I took the one at [1]) so now I exclude only >> few dependencies..could it be that I should include org.slf4j:* if I use >> static import of it? >> >> <artifactSet> >> <excludes> >> <exclude>com.google.code.findbugs:jsr305</exclude> >> <exclude>org.slf4j:*</exclude> >> <exclude>log4j:*</exclude> >> </excludes> >> </artifactSet> >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies >> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html*appendix-template-for-building-a-jar-with-dependencies__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaWGhZYoqQ$> >> >> On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz> wrote: >> >>> Yes, that could definitely cause this. You should probably avoid using >>> these flink-internal shaded classes and ship your own versions (not shaded). >>> >>> Best, >>> >>> Jan >>> On 11/16/20 3:22 PM, Flavio Pompermaier wrote: >>> >>> Thank you Jan for your valuable feedback. >>> Could it be that I should not use import shaded-jackson classes in my >>> user code? >>> For example import >>> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper? >>> >>> Bets, >>> Flavio >>> >>> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz> wrote: >>> >>>> Hi Flavio, >>>> >>>> when I encountered quite similar problem that you describe, it was >>>> related to a static storage located in class that was loaded >>>> "parent-first". In my case it was it was in java.lang.ClassValue, but it >>>> might (and probably will be) different in your case. The problem is that if >>>> user-code registers something in some (static) storage located in class >>>> loaded with parent (TaskTracker) classloader, then its associated classes >>>> will never be GC'd and Metaspace will grow. A good starting point would be >>>> not to focus on biggest consumers of heap (in general), but to look at >>>> where the 15k objects of type Class are referenced from. That might help >>>> you figure this out. I'm not sure if there is something that can be done in >>>> general to prevent this type of leaks. That would be probably question on >>>> dev@ mailing list. >>>> >>>> Best, >>>> >>>> Jan >>>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote: >>>> >>>> Hello everybody, >>>> I was writing this email when a similar thread on this mailing list >>>> appeared.. >>>> The difference is that the other problem seems to be related with Flink >>>> 1.10 on YARN and does not output anything helpful in debugging the cause of >>>> the problem. >>>> >>>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone >>>> session cluster (the job is submitted to the cluster using the CLI client). >>>> The problem arises when I submit the same job for about 20 times (this >>>> number unfortunately is not deterministic and can change a little bit). The >>>> error reported by the Task Executor is related to the ever growing >>>> Metaspace..the error seems to be pretty detailed [1]. >>>> >>>> I found the same issue in some previous threads on this mailing list >>>> and I've tried to figure it out the cause of the problem. The issue is that >>>> looking at the objects allocated I don't really get an idea of the source >>>> of the problem because the type of objects that are consuming the memory >>>> are of general purpose (i.e. Bytes, Integers and Strings)...these are my >>>> "top" memory consumers if looking at the output of jmap -histo <PID>: >>>> >>>> At run 0: >>>> >>>> num #instances #bytes class name (module) >>>> ------------------------------------------------------- >>>> 1: 46238 13224056 [B (java.base@11.0.9.1) >>>> 2: 3736 6536672 [I (java.base@11.0.9.1) >>>> 3: 38081 913944 java.lang.String ( >>>> java.base@11.0.9.1) >>>> 4: 26 852384 >>>> [Lakka.dispatch.forkjoin.ForkJoinTask; >>>> 5: 7146 844984 java.lang.Class (java.base@11.0.9.1 >>>> ) >>>> >>>> At run 1: >>>> >>>> 1: 77.608 25.317.496 [B (java.base@11.0.9.1) >>>> 2: 7.004 9.088.360 [I (java.base@11.0.9.1) >>>> 3: 15.814 1.887.256 java.lang.Class ( >>>> java.base@11.0.9.1) >>>> 4: 67.381 1.617.144 java.lang.String ( >>>> java.base@11.0.9.1) >>>> 5: 3.906 1.422.960 [Ljava.util.HashMap$Node; ( >>>> java.base@11.0.9.1) >>>> >>>> At run 6: >>>> >>>> 1: 81.408 25.375.400 [B (java.base@11.0.9.1) >>>> 2: 12.479 7.249.392 [I (java.base@11.0.9.1) >>>> 3: 29.090 3.496.168 java.lang.Class ( >>>> java.base@11.0.9.1) >>>> 4: 4.347 2.813.416 [Ljava.util.HashMap$Node; ( >>>> java.base@11.0.9.1) >>>> 5: 71.584 1.718.016 java.lang.String ( >>>> java.base@11.0.9.1) >>>> >>>> At run 8: >>>> >>>> 1: 985.979 127.193.256 [B (java.base@11.0.9.1) >>>> 2: 35.400 13.702.112 [I (java.base@11.0.9.1) >>>> 3: 260.387 6.249.288 java.lang.String ( >>>> java.base@11.0.9.1) >>>> 4: 148.836 5.953.440 java.util.HashMap$KeyIterator ( >>>> java.base@11.0.9.1) >>>> 5: 17.641 5.222.344 [Ljava.util.HashMap$Node; ( >>>> java.base@11.0.9.1) >>>> >>>> Thanks in advance for any help, >>>> Flavio >>>> >>>> [1] >>>> -------------------------------------------------------------------------------------------------- >>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory >>>> error has occurred. This can mean two things: either the job requires a >>>> larger size of JVM metaspace to load classes or there is a class loading >>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size' >>>> configuration option should be increased. If the error persists (usually in >>>> cluster after several job (re-)submissions) then there is probably a class >>>> loading leak in user code or some of its dependencies which has to be >>>> investigated and fixed. The task executor has to be shutdown... >>>> at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?] >>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) >>>> ~[?:?] >>>> at >>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) >>>> ~[?:?] >>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:550) >>>> ~[?:?] >>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?] >>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?] >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> ~[?:?] >>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:451) >>>> ~[?:?] >>>> at >>>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71) >>>> ~[flink-dist_2.12-1.11.0.jar:1.11.0] >>>> at >>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) >>>> [flink-dist_2.12-1.11.0.jar:1.11.0] >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?] >>>> >>>> ------------------------------ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. > > > >