Re: How to debug Metaspace exception?

John Smith Tue, 26 Apr 2022 06:36:37 -0700

So I put classloader.parent-first-patterns.additional: "org.apache.ignite."
in the task config and so far I don't think I'm getting
"java.lang.OutOfMemoryError:
Metaspace" any more.


Or it's too early to tell.

Though now, the task managers are shutting down due to some other failures.

So maybe because tasks were failing and reloading often the task manager
was running out of Metspace. But now maybe it's just cleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith <[email protected]> wrote:

> Or I can put in the config to treat org.apache.ignite. classes as first
> class?
>
> On Tue, Apr 19, 2022 at 10:18 PM John Smith <[email protected]>
> wrote:
>
>> Ok, so I loaded the dump into Eclipse Mat and followed:
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>> "Exclude all phantom/weak/soft references"
>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>
>> So i'm guessing anything JDBC based. I should copy into the task manager
>> libs folder and my jobs make the dependencies as compile only?
>>
>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <[email protected]>
>> wrote:
>>
>>> Also
>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>> might be helpful (has a section on profiling, as well as classloading).
>>>
>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <[email protected]>
>>> wrote:
>>>
>>>> We have a very rough "guide" in the wiki (it's just the specific steps
>>>> I took to debug another leak):
>>>>
>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>
>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>
>>>> Hi, John
>>>>
>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>>>> Check whether have too many loaded classes.
>>>>
>>>> [1] https://www.eclipse.org/mat/
>>>>
>>>> 2022年4月18日 下午9:55，John Smith <[email protected]> 写道：
>>>>
>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>
>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, so I have a dump file. What do I look for?
>>>>>
>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>> classes are unloaded correctly?
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> The difference is that manually canceling the job stops the
>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But 
>>>>>>> looking
>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>
>>>>>>>
>>>>>>> 2022年3月31日 上午4:01，John Smith <[email protected]> 写道：
>>>>>>>
>>>>>>> Also if I manually cancel and restart the same job over and over is
>>>>>>> it the same as if flink was restarting a job due to failure?
>>>>>>>
>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>> reason?
>>>>>>>
>>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>> about this...
>>>>>>>
>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <[email protected]> wrote:
>>>>>>>
>>>>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>>>>> the similar dump?
>>>>>>>>
>>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>>> you can have a try.
>>>>>>>>
>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory 
>>>>>>>> right so
>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>
>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>>>>>>>> to dump. you can use "jmap -dump:live" to dump only the reachable 
>>>>>>>> objects,
>>>>>>>> this will take a brief pause
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2022年3月30日 下午9:47，John Smith <[email protected]> 写道：
>>>>>>>>
>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>> jobs with 25 slots being used.
>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>>>>>>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>
>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>>> the same jobs in my dev env will I still be able to see the similar 
>>>>>>>> dump? I
>>>>>>>> I assume so. If not I would have to wait at a low volume time to do it 
>>>>>>>> on
>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory 
>>>>>>>> right so
>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>
>>>>>>>>
>>>>>>>> # Operating system has 16GB total.
>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>
>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>
>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>> parallelism.default: 1
>>>>>>>>
>>>>>>>> high-availability: zookeeper
>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>
>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>
>>>>>>>> state.backend: rocksdb
>>>>>>>> state.backend.incremental: true
>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>
>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi, John
>>>>>>>>>
>>>>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>>>>> cluster with a lot of jobs?
>>>>>>>>>
>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools such
>>>>>>>>> as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <[email protected]> 写道：
>>>>>>>>> >
>>>>>>>>> > Hi running 1.14.4
>>>>>>>>> >
>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can 
>>>>>>>>> mean
>>>>>>>>> two things: either the job requires a larger size of JVM metaspace to 
>>>>>>>>> load
>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>> >
>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>> >
>>>>>>>>> > But the task nodes still fail.
>>>>>>>>> >
>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I
>>>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how 
>>>>>>>>> can
>>>>>>>>> we debug this issue?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>>

Re: How to debug Metaspace exception?

Reply via email to