Re: How to debug Metaspace exception?

Chesnay Schepler Tue, 19 Apr 2022 04:35:43 -0700

We have a very rough "guide" in the wiki (it's just the specific steps Itook to debug another leak):

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks


On 19/04/2022 12:01, huweihua wrote:

Hi, John

Sorry for the late reply. You can use MAT[1] to analyze the dump file.Check whether have too many loaded classes.


[1] https://www.eclipse.org/mat/

2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：

Hi, can anyone help with this? I never looked at a dump file before.

On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com>wrote:


    Hi, so I have a dump file. What do I look for?

    On Thu, Mar 31, 2022 at 3:28 PM John Smith
    <java.dev....@gmail.com> wrote:

        Ok so if there's a leak, if I manually stop the job and
        restart it from the UI multiple times, I won't see the issue
        because because the classes are unloaded correctly?


        On Thu, Mar 31, 2022 at 9:20 AM huweihua
        <huweihua....@gmail.com> wrote:


            The difference is that manually canceling the job stops
            the JobMaster, but automatic failover keeps the JobMaster
            running. But looking on TaskManager, it doesn't make much
            difference

            2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com>
            写道：

            Also if I manually cancel and restart the same job over
            and over is it the same as if flink was restarting a job
            due to failure?

            I.e: When I click "Cancel Job" on the UI is the job
            completely unloaded vs when the job scheduler restarts a
            job because if whatever reason?

            Lile this I'll stop and restart the job a few times or
            maybe I can trick my job to fail and have the scheduler
            restart it. Ok let me think about this...

            On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
            <huweihua....@gmail.com> wrote:

                So if I run the same jobs in my dev env will I
                still be able to see the similar dump?

                I think running the same job in dev should be
                reproducible, maybe you can have a try.

                 If not I would have to wait at a low volume time
                to do it on production. Aldo if I recall the dump
                is as big as the JVM memory right so if I have 10GB
                configed for the JVM the dump will be 10GB file?

                Yes, JMAP will pause the JVM, the time of pause
                depends on the size to dump. you can use "jmap
                -dump:live" to dump only the reachable objects, this
                will take a brief pause

                2022年3月30日 下午9:47，John Smith
                <java.dev....@gmail.com> 写道：

                I have 3 task managers (see config below). There is
                total of 10 jobs with 25 slots being used.
                The jobs are 100% ETL I.e; They load Json,
                transform it and push it to JDBC, only 1 job of the
                10 is pushing to Apache Ignite cluster.

                FOR JMAP. I know that it will pause the task
                manager. So if I run the same jobs in my dev env
                will I still be able to see the similar dump? I I
                assume so. If not I would have to wait at a low
                volume time to do it on production. Aldo if I
                recall the dump is as big as the JVM memory right
                so if I have 10GB configed for the JVM the dump
                will be 10GB file?


                # Operating system has 16GB total.
                env.ssh.opts: -l flink -oStrictHostKeyChecking=no

                cluster.evenly-spread-out-slots: true

                taskmanager.memory.flink.size: 10240m
                taskmanager.memory.jvm-metaspace.size: 2048m
                taskmanager.numberOfTaskSlots: 16
                parallelism.default: 1

                high-availability: zookeeper
                high-availability.storageDir:
                file:///mnt/flink/ha/flink_1_14/
                high-availability.zookeeper.quorum: ...
                high-availability.zookeeper.path.root: /flink_1_14
                high-availability.cluster-id: /flink_1_14_cluster_0001

                web.upload.dir: /mnt/flink/uploads/flink_1_14

                state.backend: rocksdb
                state.backend.incremental: true
                state.checkpoints.dir:
                file:///mnt/flink/checkpoints/flink_1_14
                state.savepoints.dir:
                file:///mnt/flink/savepoints/flink_1_14

                On Wed, Mar 30, 2022 at 2:16 AM 胡伟华
                <huweihua....@gmail.com> wrote:

                    Hi, John

                    Could you tell us you application scenario? Is
                    it a flink session cluster with a lot of jobs?

                    Maybe you can try to dump the memory with jmap
                    and use tools such as MAT to analyze whether
                    there are abnormal classes and classloaders


                    > 2022年3月30日 上午6:09，John Smith
                    <java.dev....@gmail.com> 写道：
                    >
                    > Hi running 1.14.4
                    >
                    > My tasks manager still fails with
                    java.lang.OutOfMemoryError: Metaspace. The
                    metaspace out-of-memory error has occurred.
                    This can mean two things: either the job
                    requires a larger size of JVM metaspace to load
                    classes or there is a class loading leak.
                    >
                    > I have 2GB of metaspace configed
                    taskmanager.memory.jvm-metaspace.size: 2048m
                    >
                    > But the task nodes still fail.
                    >
                    > When looking at the UI metrics, the metaspace
                    starts low. Now I see 85% usage. It seems to be
                    a class loading leak at this point, how can we
                    debug this issue?

Re: How to debug Metaspace exception?

Reply via email to