Re: [External Sender] Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

Flavio Pompermaier Mon, 16 Nov 2020 11:44:52 -0800

Thank you Kye for your insights...in my mind, if the job runs without
problems one or more times the heap size, and thus the medatadata-size, is
big enough and I should not increase it (on the same data of course).
So I'll try to understand who is leaking what..the advice to avoid the
dynamic class loading is just a workaround to me..there's something wrong
going on and tomorrow I'll try to understand the root cause of the
problem using -XX:NativeMemoryTracking=summary as you suggested.


I'll keep you up to date with my findings..

Best,
Flavio

On Mon, Nov 16, 2020 at 8:22 PM Kye Bae <kye....@capitalone.com> wrote:

> Hello!
>
> The JVM metaspace is where all the classes (not class instances or
> objects) get loaded. jmap -histo is going to show you the heap space usage
> info not the metaspace.
>
> You could inspect what is happening in the metaspace by using jcmd (e.g.,
> jcmd JPID VM.native_memory summary) after restarting the cluster with "
> *-XX:NativeMemoryTracking=summary"*
>
> *As the error message suggests, you may need to increase 
> *taskmanager.memory.jvm-metaspace.size,
> but you need to be slightly careful when specifying the memory parameters
> in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error
> message).
>
> Anothing thing to keep in mind is that you may want to avoid using dynamic
> classloading (
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code):
> when the job continuously fails for some temporary issues, it will load the
> same class files into the metaspace multiple times and could exceed
> whatever the limit you set it.
>
> -K
>
> On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský <je...@seznam.cz> wrote:
>
>> The exclusions should not have any impact on that, because what defines
>> which classloader will load which class is not the presence or particular
>> class in a specific jar, but the configuration of parent-first-patterns [1].
>>
>> If you don't use any flink internal imports, than it still might be the
>> case, that a class in any of the packages defined by the
>> parent-first-pattern to hold reference to your user-code classes, which
>> would cause the leak. I'd recommend to inspect the heap dump after several
>> restarts of the application and look for reference to Class objects from
>> the root set.
>>
>> Jan
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading
>> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html*class-loading__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaX4_W2wJA$>
>> On 11/16/20 5:34 PM, Flavio Pompermaier wrote:
>>
>> I've tried to remove all possible imports of classes not contained in the
>> fat jar but I still face the same problem.
>> I've also tried to reduce as much as possible the exclude in the shade
>> section of the maven plugin (I took the one at [1]) so now I exclude only
>> few dependencies..could it be that I should include org.slf4j:* if I use
>> static import of it?
>>
>> <artifactSet>
>>     <excludes>
>>       <exclude>com.google.code.findbugs:jsr305</exclude>
>>       <exclude>org.slf4j:*</exclude>
>>       <exclude>log4j:*</exclude>
>>     </excludes>
>> </artifactSet>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies
>> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html*appendix-template-for-building-a-jar-with-dependencies__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaWGhZYoqQ$>
>>
>> On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Yes, that could definitely cause this. You should probably avoid using
>>> these flink-internal shaded classes and ship your own versions (not shaded).
>>>
>>> Best,
>>>
>>>  Jan
>>> On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
>>>
>>> Thank you Jan for your valuable feedback.
>>> Could it be that I should not use import shaded-jackson classes in my
>>> user code?
>>> For example import
>>> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?
>>>
>>> Bets,
>>> Flavio
>>>
>>> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> when I encountered quite similar problem that you describe, it was
>>>> related to a static storage located in class that was loaded
>>>> "parent-first". In my case it was it was in java.lang.ClassValue, but it
>>>> might (and probably will be) different in your case. The problem is that if
>>>> user-code registers something in some (static) storage located in class
>>>> loaded with parent (TaskTracker) classloader, then its associated classes
>>>> will never be GC'd and Metaspace will grow. A good starting point would be
>>>> not to focus on biggest consumers of heap (in general), but to look at
>>>> where the 15k objects of type Class are referenced from. That might help
>>>> you figure this out. I'm not sure if there is something that can be done in
>>>> general to prevent this type of leaks. That would be probably question on
>>>> dev@ mailing list.
>>>>
>>>> Best,
>>>>
>>>>  Jan
>>>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
>>>>
>>>> Hello everybody,
>>>> I was writing this email when a similar thread on this mailing list
>>>> appeared..
>>>> The difference is that the other problem seems to be related with Flink
>>>> 1.10 on YARN and does not output anything helpful in debugging the cause of
>>>> the problem.
>>>>
>>>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone
>>>> session cluster (the job is submitted to the cluster using the CLI client).
>>>> The problem arises when I submit the same job for about 20 times (this
>>>> number unfortunately is not deterministic and can change a little bit). The
>>>> error reported by the Task Executor is related to the ever growing
>>>> Metaspace..the error seems to be pretty detailed [1].
>>>>
>>>> I found the same issue in some previous threads on this mailing list
>>>> and I've tried to figure it out the cause of the problem. The issue is that
>>>> looking at the objects allocated I don't really get an idea of the source
>>>> of the problem because the type of objects that are consuming the memory
>>>> are of general purpose (i.e. Bytes, Integers and Strings)...these are my
>>>> "top" memory consumers if looking at the output of  jmap -histo <PID>:
>>>>
>>>> At run 0:
>>>>
>>>>  num     #instances         #bytes  class name (module)
>>>> -------------------------------------------------------
>>>>    1:         46238       13224056  [B (java.base@11.0.9.1)
>>>>    2:          3736        6536672  [I (java.base@11.0.9.1)
>>>>    3:         38081         913944  java.lang.String (
>>>> java.base@11.0.9.1)
>>>>    4:            26         852384
>>>>  [Lakka.dispatch.forkjoin.ForkJoinTask;
>>>>    5:          7146         844984  java.lang.Class (java.base@11.0.9.1
>>>> )
>>>>
>>>> At run 1:
>>>>
>>>>    1:         77.608       25.317.496  [B (java.base@11.0.9.1)
>>>>    2:          7.004        9.088.360  [I (java.base@11.0.9.1)
>>>>    3:         15.814        1.887.256  java.lang.Class (
>>>> java.base@11.0.9.1)
>>>>    4:         67.381        1.617.144  java.lang.String (
>>>> java.base@11.0.9.1)
>>>>    5:          3.906        1.422.960  [Ljava.util.HashMap$Node; (
>>>> java.base@11.0.9.1)
>>>>
>>>> At run 6:
>>>>
>>>>    1:         81.408       25.375.400  [B (java.base@11.0.9.1)
>>>>    2:         12.479        7.249.392  [I (java.base@11.0.9.1)
>>>>    3:         29.090        3.496.168  java.lang.Class (
>>>> java.base@11.0.9.1)
>>>>    4:          4.347        2.813.416  [Ljava.util.HashMap$Node; (
>>>> java.base@11.0.9.1)
>>>>    5:         71.584        1.718.016  java.lang.String (
>>>> java.base@11.0.9.1)
>>>>
>>>> At run 8:
>>>>
>>>>    1:        985.979      127.193.256  [B (java.base@11.0.9.1)
>>>>    2:         35.400       13.702.112  [I (java.base@11.0.9.1)
>>>>    3:        260.387        6.249.288  java.lang.String (
>>>> java.base@11.0.9.1)
>>>>    4:        148.836        5.953.440  java.util.HashMap$KeyIterator (
>>>> java.base@11.0.9.1)
>>>>    5:         17.641        5.222.344  [Ljava.util.HashMap$Node; (
>>>> java.base@11.0.9.1)
>>>>
>>>> Thanks in advance for any help,
>>>> Flavio
>>>>
>>>> [1]
>>>> --------------------------------------------------------------------------------------------------
>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory
>>>> error has occurred. This can mean two things: either the job requires a
>>>> larger size of JVM metaspace to load classes or there is a class loading
>>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size'
>>>> configuration option should be increased. If the error persists (usually in
>>>> cluster after several job (re-)submissions) then there is probably a class
>>>> loading leak in user code or some of its dependencies which has to be
>>>> investigated and fixed. The task executor has to be shutdown...
>>>>         at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
>>>>         at java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>>>> ~[?:?]
>>>>         at
>>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
>>>> ~[?:?]
>>>>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:550)
>>>> ~[?:?]
>>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
>>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>> ~[?:?]
>>>>         at java.net.URLClassLoader.findClass(URLClassLoader.java:451)
>>>> ~[?:?]
>>>>         at
>>>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
>>>> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
>>>>         at
>>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
>>>> [flink-dist_2.12-1.11.0.jar:1.11.0]
>>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]
>>>>
>>>> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>

Re: [External Sender] Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

Reply via email to