Re: How to debug Metaspace exception?

Yaroslav Tkachenko Tue, 19 Apr 2022 09:18:08 -0700

Also https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as well as classloading).


On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org> wrote:

> We have a very rough "guide" in the wiki (it's just the specific steps I
> took to debug another leak):
>
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> On 19/04/2022 12:01, huweihua wrote:
>
> Hi, John
>
> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
> Check whether have too many loaded classes.
>
> [1] https://www.eclipse.org/mat/
>
> 2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：
>
> Hi, can anyone help with this? I never looked at a dump file before.
>
> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com>
> wrote:
>
>> Hi, so I have a dump file. What do I look for?
>>
>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com>
>> wrote:
>>
>>> Ok so if there's a leak, if I manually stop the job and restart it from
>>> the UI multiple times, I won't see the issue because because the classes
>>> are unloaded correctly?
>>>
>>>
>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> wrote:
>>>
>>>>
>>>> The difference is that manually canceling the job stops the JobMaster,
>>>> but automatic failover keeps the JobMaster running. But looking on
>>>> TaskManager, it doesn't make much difference
>>>>
>>>>
>>>> 2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com> 写道：
>>>>
>>>> Also if I manually cancel and restart the same job over and over is it
>>>> the same as if flink was restarting a job due to failure?
>>>>
>>>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded
>>>> vs when the job scheduler restarts a job because if whatever reason?
>>>>
>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>> about this...
>>>>
>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> wrote:
>>>>
>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>> the similar dump?
>>>>>
>>>>> I think running the same job in dev should be reproducible, maybe you
>>>>> can have a try.
>>>>>
>>>>>  If not I would have to wait at a low volume time to do it on
>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>
>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size to
>>>>> dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>>>> this will take a brief pause
>>>>>
>>>>>
>>>>>
>>>>> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com> 写道：
>>>>>
>>>>> I have 3 task managers (see config below). There is total of 10 jobs
>>>>> with 25 slots being used.
>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to
>>>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>
>>>>> FOR JMAP. I know that it will pause the task manager. So if I run the
>>>>> same jobs in my dev env will I still be able to see the similar dump? I I
>>>>> assume so. If not I would have to wait at a low volume time to do it on
>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>
>>>>>
>>>>> # Operating system has 16GB total.
>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>
>>>>> cluster.evenly-spread-out-slots: true
>>>>>
>>>>> taskmanager.memory.flink.size: 10240m
>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>> taskmanager.numberOfTaskSlots: 16
>>>>> parallelism.default: 1
>>>>>
>>>>> high-availability: zookeeper
>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>> high-availability.zookeeper.quorum: ...
>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>
>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>
>>>>> state.backend: rocksdb
>>>>> state.backend.incremental: true
>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>
>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> wrote:
>>>>>
>>>>>> Hi, John
>>>>>>
>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>> cluster with a lot of jobs?
>>>>>>
>>>>>> Maybe you can try to dump the memory with jmap and use tools such as
>>>>>> MAT to analyze whether there are abnormal classes and classloaders
>>>>>>
>>>>>>
>>>>>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com> 写道：
>>>>>> >
>>>>>> > Hi running 1.14.4
>>>>>> >
>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>> two things: either the job requires a larger size of JVM metaspace to 
>>>>>> load
>>>>>> classes or there is a class loading leak.
>>>>>> >
>>>>>> > I have 2GB of metaspace configed
>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>> >
>>>>>> > But the task nodes still fail.
>>>>>> >
>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I see
>>>>>> 85% usage. It seems to be a class loading leak at this point, how can we
>>>>>> debug this issue?
>>>>>>
>>>>>>
>>>>>
>>>>
>
>

Re: How to debug Metaspace exception?

Reply via email to