Re: How to debug Metaspace exception?

Chesnay Schepler Wed, 27 Apr 2022 00:44:08 -0700

You're misinterpreting the docs.

The parent/child-first classloading controls where Flink looks for aclass /first/, specifically whether we first load from /lib or the user-jar.It does not allow you to load something from the user-jar in the parentclassloader. That's just not how it works.


It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib folder or use|classloader.parent-first-patterns-additional|<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

I prefer the latter like this: the dependency stays with the user-jarand not on the task manager.


On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> wrote:

    Ok so I should put the Apache ignite and my Microsoft drivers in
    the lib folders of my task managers?

    And then in my job jar only include them as compile time
    dependencies?


    On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
    <ches...@apache.org> wrote:

        JDBC drivers are well-known for leaking classloaders
        unfortunately.

        You have correctly identified your alternatives.

        You must put the jdbc driver into /lib instead. Setting only
        the parent-first pattern shouldn't affect anything.
        That is only relevant if something is in both in /lib and the
        user-jar, telling Flink to prioritize what is in lib.



        On 26/04/2022 15:35, John Smith wrote:

        So I put classloader.parent-first-patterns.additional:
        "org.apache.ignite." in the task config and so far I don't
        think I'm getting "java.lang.OutOfMemoryError: Metaspace" any
        more.

        Or it's too early to tell.

        Though now, the task managers are shutting down due to some
        other failures.

        So maybe because tasks were failing and reloading often the
        task manager was running out of Metspace. But now maybe it's
        just cleanly shutting down.

        On Wed, Apr 20, 2022 at 11:35 AM John Smith
        <java.dev....@gmail.com> wrote:

            Or I can put in the config to treat org.apache.ignite.
            classes as first class?

            On Tue, Apr 19, 2022 at 10:18 PM John Smith
            <java.dev....@gmail.com> wrote:

                Ok, so I loaded the dump into Eclipse Mat and
                followed:
                
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                - On the Histogram, I got over 30 entries for:
                ChildFirstClassLoader
                - Then I clicked on one of them "Merge Shortest
                Path..." and picked "Exclude all phantom/weak/soft
                references"
                - Which then gave me: SqlDriverManager > Apache
                Ignite JdbcThin Driver

                So i'm guessing anything JDBC based. I should copy
                into the task manager libs folder and my jobs make
                the dependencies as compile only?

                On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
                <yaros...@goldsky.io> wrote:

                    Also
                    
https://shopify.engineering/optimizing-apache-flink-applications-tips
                    might be helpful (has a section on profiling, as
                    well as classloading).

                    On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
                    <ches...@apache.org> wrote:

                        We have a very rough "guide" in the wiki
                        (it's just the specific steps I took to debug
                        another leak):
                        
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                        On 19/04/2022 12:01, huweihua wrote:

                        Hi, John

                        Sorry for the late reply. You can use MAT[1]
                        to analyze the dump file. Check whether have
                        too many loaded classes.

                        [1] https://www.eclipse.org/mat/

                        2022年4月18日 下午9:55，John Smith
                        <java.dev....@gmail.com> 写道：

                        Hi, can anyone help with this? I never
                        looked at a dump file before.

                        On Thu, Apr 14, 2022 at 11:59 AM John Smith
                        <java.dev....@gmail.com> wrote:

                            Hi, so I have a dump file. What do I
                            look for?

                            On Thu, Mar 31, 2022 at 3:28 PM John
                            Smith <java.dev....@gmail.com> wrote:

                                Ok so if there's a leak, if I
                                manually stop the job and restart
                                it from the UI multiple times, I
                                won't see the issue because because
                                the classes are unloaded correctly?


                                On Thu, Mar 31, 2022 at 9:20 AM
                                huweihua <huweihua....@gmail.com>
                                wrote:


                                    The difference is that manually
                                    canceling the job stops the
                                    JobMaster, but automatic
                                    failover keeps the JobMaster
                                    running. But looking on
                                    TaskManager, it doesn't make
                                    much difference

                                    2022年3月31日 上午4:01，John Smith
                                    <java.dev....@gmail.com> 写道：

                                    Also if I manually cancel and
                                    restart the same job over and
                                    over is it the same as if
                                    flink was restarting a job due
                                    to failure?

                                    I.e: When I click "Cancel Job"
                                    on the UI is the job
                                    completely unloaded vs when
                                    the job scheduler restarts a
                                    job because if whatever reason?

                                    Lile this I'll stop and
                                    restart the job a few times or
                                    maybe I can trick my job to
                                    fail and have the scheduler
                                    restart it. Ok let me think
                                    about this...

                                    On Wed, Mar 30, 2022 at 10:24
                                    AM 胡伟华
                                    <huweihua....@gmail.com> wrote:

                                        So if I run the same jobs
                                        in my dev env will I
                                        still be able to see the
                                        similar dump?

                                        I think running the same
                                        job in dev should be
                                        reproducible, maybe you
                                        can have a try.

                                         If not I would have to
                                        wait at a low volume time
                                        to do it on production.
                                        Aldo if I recall the dump
                                        is as big as the JVM
                                        memory right so if I have
                                        10GB configed for the JVM
                                        the dump will be 10GB file?

                                        Yes, JMAP will pause the
                                        JVM, the time of pause
                                        depends on the size to
                                        dump. you can use "jmap
                                        -dump:live" to dump only
                                        the reachable objects,
                                        this will take a brief pause

                                        2022年3月30日 下午9:47，John
                                        Smith
                                        <java.dev....@gmail.com>
                                        写道：

                                        I have 3 task managers
                                        (see config below). There
                                        is total of 10 jobs with
                                        25 slots being used.
                                        The jobs are 100% ETL
                                        I.e; They load Json,
                                        transform it and push it
                                        to JDBC, only 1 job of
                                        the 10 is pushing to
                                        Apache Ignite cluster.

                                        FOR JMAP. I know that it
                                        will pause the task
                                        manager. So if I run the
                                        same jobs in my dev env
                                        will I still be able to
                                        see the similar dump? I I
                                        assume so. If not I would
                                        have to wait at a low
                                        volume time to do it on
                                        production. Aldo if I
                                        recall the dump is as big
                                        as the JVM memory right
                                        so if I have 10GB
                                        configed for the JVM the
                                        dump will be 10GB file?


                                        # Operating system has
                                        16GB total.
                                        env.ssh.opts: -l flink
                                        -oStrictHostKeyChecking=no

                                        cluster.evenly-spread-out-slots:
                                        true

                                        taskmanager.memory.flink.size:
                                        10240m
                                        taskmanager.memory.jvm-metaspace.size:
                                        2048m
                                        taskmanager.numberOfTaskSlots:
                                        16
                                        parallelism.default: 1

                                        high-availability: zookeeper
                                        high-availability.storageDir:
                                        file:///mnt/flink/ha/flink_1_14/
                                        high-availability.zookeeper.quorum:
                                        ...
                                        high-availability.zookeeper.path.root:
                                        /flink_1_14
                                        high-availability.cluster-id:
                                        /flink_1_14_cluster_0001

                                        web.upload.dir:
                                        /mnt/flink/uploads/flink_1_14

                                        state.backend: rocksdb
                                        state.backend.incremental:
                                        true
                                        state.checkpoints.dir:
                                        file:///mnt/flink/checkpoints/flink_1_14
                                        state.savepoints.dir:
                                        file:///mnt/flink/savepoints/flink_1_14

                                        On Wed, Mar 30, 2022 at
                                        2:16 AM 胡伟华
                                        <huweihua....@gmail.com>
                                        wrote:

                                            Hi, John

                                            Could you tell us you
                                            application scenario?
                                            Is it a flink session
                                            cluster with a lot of
                                            jobs?

                                            Maybe you can try to
                                            dump the memory with
                                            jmap and use tools
                                            such as MAT to
                                            analyze whether there
                                            are abnormal classes
                                            and classloaders


                                            > 2022年3月30日
                                            上午6:09，John Smith
                                            <java.dev....@gmail.com>
                                            写道：
                                            >
                                            > Hi running 1.14.4
                                            >
                                            > My tasks manager
                                            still fails with
                                            java.lang.OutOfMemoryError:
                                            Metaspace. The
                                            metaspace
                                            out-of-memory error
                                            has occurred. This
                                            can mean two things:
                                            either the job
                                            requires a larger
                                            size of JVM metaspace
                                            to load classes or
                                            there is a class
                                            loading leak.
                                            >
                                            > I have 2GB of
                                            metaspace configed
                                            
taskmanager.memory.jvm-metaspace.size:
                                            2048m
                                            >
                                            > But the task nodes
                                            still fail.
                                            >
                                            > When looking at the
                                            UI metrics, the
                                            metaspace starts low.
                                            Now I see 85% usage.
                                            It seems to be a
                                            class loading leak at
                                            this point, how can
                                            we debug this issue?

Re: How to debug Metaspace exception?

Reply via email to