Re: How to debug Metaspace exception?

John Smith Thu, 28 Apr 2022 06:50:08 -0700

You sure?

   -


   *JDBC*: JDBC drivers leak references outside the user code classloader.
   To ensure that these classes are only loaded once you should either add the
   driver jars to Flink’s lib/ folder, or add the driver classes to the
   list of parent-first loaded class via
   classloader.parent-first-patterns-additional
   
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
   .

   It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> wrote:

> You're misinterpreting the docs.
>
> The parent/child-first classloading controls where Flink looks for a class
> *first*, specifically whether we first load from /lib or the user-jar.
> It does not allow you to load something from the user-jar in the parent
> classloader. That's just not how it works.
>
> It must be in /lib.
>
> On 27/04/2022 04:59, John Smith wrote:
>
> Hi Chesnay as per the docs...
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>
> You can either put the jars in task manager lib folder or use
> classloader.parent-first-patterns-additional
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>
> I prefer the latter like this: the dependency stays with the user-jar and
> not on the task manager.
>
> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> wrote:
>
>> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
>> folders of my task managers?
>>
>> And then in my job jar only include them as compile time dependencies?
>>
>>
>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>
>>> You have correctly identified your alternatives.
>>>
>>> You must put the jdbc driver into /lib instead. Setting only the
>>> parent-first pattern shouldn't affect anything.
>>> That is only relevant if something is in both in /lib and the user-jar,
>>> telling Flink to prioritize what is in lib.
>>>
>>>
>>>
>>> On 26/04/2022 15:35, John Smith wrote:
>>>
>>> So I put classloader.parent-first-patterns.additional:
>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>
>>> Or it's too early to tell.
>>>
>>> Though now, the task managers are shutting down due to some
>>> other failures.
>>>
>>> So maybe because tasks were failing and reloading often the task manager
>>> was running out of Metspace. But now maybe it's just cleanly shutting down.
>>>
>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> Or I can put in the config to treat org.apache.ignite. classes as first
>>>> class?
>>>>
>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>
>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>>> "Exclude all phantom/weak/soft references"
>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>>>>
>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>
>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>> yaros...@goldsky.io> wrote:
>>>>>
>>>>>> Also
>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>
>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>> steps I took to debug another leak):
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>
>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>
>>>>>>> Hi, John
>>>>>>>
>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>
>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>
>>>>>>> 2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>
>>>>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>>>>
>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>
>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>>>>> classes are unloaded correctly?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But 
>>>>>>>>>> looking
>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> Also if I manually cancel and restart the same job over and over
>>>>>>>>>> is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>
>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>>> reason?
>>>>>>>>>>
>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I
>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let 
>>>>>>>>>> me think
>>>>>>>>>> about this...
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to
>>>>>>>>>>> see the similar dump?
>>>>>>>>>>>
>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>
>>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory 
>>>>>>>>>>> right so
>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>
>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the 
>>>>>>>>>>> reachable
>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>>
>>>>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>>>>> jobs with 25 slots being used.
>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push
>>>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite 
>>>>>>>>>>> cluster.
>>>>>>>>>>>
>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I
>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the 
>>>>>>>>>>> similar
>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume 
>>>>>>>>>>> time to do
>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM 
>>>>>>>>>>> memory
>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB 
>>>>>>>>>>> file?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>
>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>
>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>
>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>
>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>
>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>
>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and 
>>>>>>>>>>>> classloaders
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>>> >
>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>> >
>>>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This 
>>>>>>>>>>>> can mean
>>>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace 
>>>>>>>>>>>> to load
>>>>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>>>>> >
>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>> >
>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>> >
>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now
>>>>>>>>>>>> I see 85% usage. It seems to be a class loading leak at this 
>>>>>>>>>>>> point, how can
>>>>>>>>>>>> we debug this issue?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>
>

Re: How to debug Metaspace exception?

Reply via email to