Re: How to debug Metaspace exception?

John Smith Wed, 20 Apr 2022 08:36:33 -0700

Or I can put in the config to treat org.apache.ignite. classes as first
class?


On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> wrote:

> Ok, so I loaded the dump into Eclipse Mat and followed:
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
> - Then I clicked on one of them "Merge Shortest Path..." and picked
> "Exclude all phantom/weak/soft references"
> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>
> So i'm guessing anything JDBC based. I should copy into the task manager
> libs folder and my jobs make the dependencies as compile only?
>
> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <yaros...@goldsky.io>
> wrote:
>
>> Also
>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>> might be helpful (has a section on profiling, as well as classloading).
>>
>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> We have a very rough "guide" in the wiki (it's just the specific steps I
>>> took to debug another leak):
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> On 19/04/2022 12:01, huweihua wrote:
>>>
>>> Hi, John
>>>
>>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>>> Check whether have too many loaded classes.
>>>
>>> [1] https://www.eclipse.org/mat/
>>>
>>> 2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：
>>>
>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>
>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> Hi, so I have a dump file. What do I look for?
>>>>
>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>> from the UI multiple times, I won't see the issue because because the
>>>>> classes are unloaded correctly?
>>>>>
>>>>>
>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> The difference is that manually canceling the job stops the
>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But 
>>>>>> looking
>>>>>> on TaskManager, it doesn't make much difference
>>>>>>
>>>>>>
>>>>>> 2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com> 写道：
>>>>>>
>>>>>> Also if I manually cancel and restart the same job over and over is
>>>>>> it the same as if flink was restarting a job due to failure?
>>>>>>
>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>> reason?
>>>>>>
>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>> about this...
>>>>>>
>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> wrote:
>>>>>>
>>>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>>>> the similar dump?
>>>>>>>
>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>> you can have a try.
>>>>>>>
>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right 
>>>>>>> so
>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>
>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>>>>>>> to dump. you can use "jmap -dump:live" to dump only the reachable 
>>>>>>> objects,
>>>>>>> this will take a brief pause
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>
>>>>>>> I have 3 task managers (see config below). There is total of 10 jobs
>>>>>>> with 25 slots being used.
>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>>>>>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>
>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>> the same jobs in my dev env will I still be able to see the similar 
>>>>>>> dump? I
>>>>>>> I assume so. If not I would have to wait at a low volume time to do it 
>>>>>>> on
>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right 
>>>>>>> so
>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>
>>>>>>>
>>>>>>> # Operating system has 16GB total.
>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>
>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>
>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>> parallelism.default: 1
>>>>>>>
>>>>>>> high-availability: zookeeper
>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>
>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>
>>>>>>> state.backend: rocksdb
>>>>>>> state.backend.incremental: true
>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>
>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, John
>>>>>>>>
>>>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>>>> cluster with a lot of jobs?
>>>>>>>>
>>>>>>>> Maybe you can try to dump the memory with jmap and use tools such
>>>>>>>> as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>
>>>>>>>>
>>>>>>>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>> >
>>>>>>>> > Hi running 1.14.4
>>>>>>>> >
>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can 
>>>>>>>> mean
>>>>>>>> two things: either the job requires a larger size of JVM metaspace to 
>>>>>>>> load
>>>>>>>> classes or there is a class loading leak.
>>>>>>>> >
>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>> >
>>>>>>>> > But the task nodes still fail.
>>>>>>>> >
>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I
>>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how 
>>>>>>>> can
>>>>>>>> we debug this issue?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>>

Re: How to debug Metaspace exception?

Reply via email to