Re: [External Sender] Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

Kye Bae Mon, 16 Nov 2020 11:23:23 -0800

Hello!

The JVM metaspace is where all the classes (not class instances or objects)
get loaded. jmap -histo is going to show you the heap space usage info not
the metaspace.


You could inspect what is happening in the metaspace by using jcmd (e.g.,
jcmd JPID VM.native_memory summary) after restarting the cluster with "
*-XX:NativeMemoryTracking=summary"*

*As the error message suggests, you may need to increase
*taskmanager.memory.jvm-metaspace.size,
but you need to be slightly careful when specifying the memory parameters
in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error
message).

Anothing thing to keep in mind is that you may want to avoid using dynamic
classloading (
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code):
when the job continuously fails for some temporary issues, it will load the
same class files into the metaspace multiple times and could exceed
whatever the limit you set it.

-K

On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský <je...@seznam.cz> wrote:

> The exclusions should not have any impact on that, because what defines
> which classloader will load which class is not the presence or particular
> class in a specific jar, but the configuration of parent-first-patterns [1].
>
> If you don't use any flink internal imports, than it still might be the
> case, that a class in any of the packages defined by the
> parent-first-pattern to hold reference to your user-code classes, which
> would cause the leak. I'd recommend to inspect the heap dump after several
> restarts of the application and look for reference to Class objects from
> the root set.
>
> Jan
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading
> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html*class-loading__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaX4_W2wJA$>
> On 11/16/20 5:34 PM, Flavio Pompermaier wrote:
>
> I've tried to remove all possible imports of classes not contained in the
> fat jar but I still face the same problem.
> I've also tried to reduce as much as possible the exclude in the shade
> section of the maven plugin (I took the one at [1]) so now I exclude only
> few dependencies..could it be that I should include org.slf4j:* if I use
> static import of it?
>
> <artifactSet>
>     <excludes>
>       <exclude>com.google.code.findbugs:jsr305</exclude>
>       <exclude>org.slf4j:*</exclude>
>       <exclude>log4j:*</exclude>
>     </excludes>
> </artifactSet>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies
> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html*appendix-template-for-building-a-jar-with-dependencies__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaWGhZYoqQ$>
>
> On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Yes, that could definitely cause this. You should probably avoid using
>> these flink-internal shaded classes and ship your own versions (not shaded).
>>
>> Best,
>>
>>  Jan
>> On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
>>
>> Thank you Jan for your valuable feedback.
>> Could it be that I should not use import shaded-jackson classes in my
>> user code?
>> For example import
>> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?
>>
>> Bets,
>> Flavio
>>
>> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi Flavio,
>>>
>>> when I encountered quite similar problem that you describe, it was
>>> related to a static storage located in class that was loaded
>>> "parent-first". In my case it was it was in java.lang.ClassValue, but it
>>> might (and probably will be) different in your case. The problem is that if
>>> user-code registers something in some (static) storage located in class
>>> loaded with parent (TaskTracker) classloader, then its associated classes
>>> will never be GC'd and Metaspace will grow. A good starting point would be
>>> not to focus on biggest consumers of heap (in general), but to look at
>>> where the 15k objects of type Class are referenced from. That might help
>>> you figure this out. I'm not sure if there is something that can be done in
>>> general to prevent this type of leaks. That would be probably question on
>>> dev@ mailing list.
>>>
>>> Best,
>>>
>>>  Jan
>>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
>>>
>>> Hello everybody,
>>> I was writing this email when a similar thread on this mailing list
>>> appeared..
>>> The difference is that the other problem seems to be related with Flink
>>> 1.10 on YARN and does not output anything helpful in debugging the cause of
>>> the problem.
>>>
>>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone
>>> session cluster (the job is submitted to the cluster using the CLI client).
>>> The problem arises when I submit the same job for about 20 times (this
>>> number unfortunately is not deterministic and can change a little bit). The
>>> error reported by the Task Executor is related to the ever growing
>>> Metaspace..the error seems to be pretty detailed [1].
>>>
>>> I found the same issue in some previous threads on this mailing list and
>>> I've tried to figure it out the cause of the problem. The issue is that
>>> looking at the objects allocated I don't really get an idea of the source
>>> of the problem because the type of objects that are consuming the memory
>>> are of general purpose (i.e. Bytes, Integers and Strings)...these are my
>>> "top" memory consumers if looking at the output of  jmap -histo <PID>:
>>>
>>> At run 0:
>>>
>>>  num     #instances         #bytes  class name (module)
>>> -------------------------------------------------------
>>>    1:         46238       13224056  [B (java.base@11.0.9.1)
>>>    2:          3736        6536672  [I (java.base@11.0.9.1)
>>>    3:         38081         913944  java.lang.String (java.base@11.0.9.1
>>> )
>>>    4:            26         852384
>>>  [Lakka.dispatch.forkjoin.ForkJoinTask;
>>>    5:          7146         844984  java.lang.Class (java.base@11.0.9.1)
>>>
>>> At run 1:
>>>
>>>    1:         77.608       25.317.496  [B (java.base@11.0.9.1)
>>>    2:          7.004        9.088.360  [I (java.base@11.0.9.1)
>>>    3:         15.814        1.887.256  java.lang.Class (
>>> java.base@11.0.9.1)
>>>    4:         67.381        1.617.144  java.lang.String (
>>> java.base@11.0.9.1)
>>>    5:          3.906        1.422.960  [Ljava.util.HashMap$Node; (
>>> java.base@11.0.9.1)
>>>
>>> At run 6:
>>>
>>>    1:         81.408       25.375.400  [B (java.base@11.0.9.1)
>>>    2:         12.479        7.249.392  [I (java.base@11.0.9.1)
>>>    3:         29.090        3.496.168  java.lang.Class (
>>> java.base@11.0.9.1)
>>>    4:          4.347        2.813.416  [Ljava.util.HashMap$Node; (
>>> java.base@11.0.9.1)
>>>    5:         71.584        1.718.016  java.lang.String (
>>> java.base@11.0.9.1)
>>>
>>> At run 8:
>>>
>>>    1:        985.979      127.193.256  [B (java.base@11.0.9.1)
>>>    2:         35.400       13.702.112  [I (java.base@11.0.9.1)
>>>    3:        260.387        6.249.288  java.lang.String (
>>> java.base@11.0.9.1)
>>>    4:        148.836        5.953.440  java.util.HashMap$KeyIterator (
>>> java.base@11.0.9.1)
>>>    5:         17.641        5.222.344  [Ljava.util.HashMap$Node; (
>>> java.base@11.0.9.1)
>>>
>>> Thanks in advance for any help,
>>> Flavio
>>>
>>> [1]
>>> --------------------------------------------------------------------------------------------------
>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>> has occurred. This can mean two things: either the job requires a larger
>>> size of JVM metaspace to load classes or there is a class loading leak. In
>>> the first case 'taskmanager.memory.jvm-metaspace.size' configuration option
>>> should be increased. If the error persists (usually in cluster after
>>> several job (re-)submissions) then there is probably a class loading leak
>>> in user code or some of its dependencies which has to be investigated and
>>> fixed. The task executor has to be shutdown...
>>>         at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
>>>         at java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>>> ~[?:?]
>>>         at
>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
>>> ~[?:?]
>>>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:550)
>>> ~[?:?]
>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>> ~[?:?]
>>>         at java.net.URLClassLoader.findClass(URLClassLoader.java:451)
>>> ~[?:?]
>>>         at
>>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
>>> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
>>>         at
>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
>>> [flink-dist_2.12-1.11.0.jar:1.11.0]
>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]
>>>
>>>

______________________________________________________________________



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: [External Sender] Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

Reply via email to