Re: How to debug Metaspace exception?

Chesnay Schepler Thu, 28 Apr 2022 06:57:02 -0700

Pretty sure, even though I seemingly documented it incorrectly :)


On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

    /JDBC/: JDBC drivers leak references outside the user code
    classloader. To ensure that these classes are only loaded once you
    should either add the driver jars to Flink’s |lib/| folder, or add
    the driver classes to the list of parent-first loaded class via
    |classloader.parent-first-patterns-additional|
    
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.

    It says either or

On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org>wrote:


    You're misinterpreting the docs.

    The parent/child-first classloading controls where Flink looks for
    a class /first/, specifically whether we first load from /lib or
    the user-jar.
    It does not allow you to load something from the user-jar in the
    parent classloader. That's just not how it works.

    It must be in /lib.

    On 27/04/2022 04:59, John Smith wrote:

    Hi Chesnay as per the docs...
    
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

    You can either put the jars in task manager lib folder or use
    |classloader.parent-first-patterns-additional|
    
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

    I prefer the latter like this: the dependency stays with the
    user-jar and not on the task manager.

    On Tue, Apr 26, 2022 at 9:52 PM John Smith
    <java.dev....@gmail.com> wrote:

        Ok so I should put the Apache ignite and my Microsoft drivers
        in the lib folders of my task managers?

        And then in my job jar only include them as compile time
        dependencies?


        On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
        <ches...@apache.org> wrote:

            JDBC drivers are well-known for leaking classloaders
            unfortunately.

            You have correctly identified your alternatives.

            You must put the jdbc driver into /lib instead. Setting
            only the parent-first pattern shouldn't affect anything.
            That is only relevant if something is in both in /lib and
            the user-jar, telling Flink to prioritize what is in lib.



            On 26/04/2022 15:35, John Smith wrote:

            So I put classloader.parent-first-patterns.additional:
            "org.apache.ignite." in the task config and so far I
            don't think I'm getting "java.lang.OutOfMemoryError:
            Metaspace" any more.

            Or it's too early to tell.

            Though now, the task managers are shutting down due to
            some other failures.

            So maybe because tasks were failing and reloading often
            the task manager was running out of Metspace. But now
            maybe it's just cleanly shutting down.

            On Wed, Apr 20, 2022 at 11:35 AM John Smith
            <java.dev....@gmail.com> wrote:

                Or I can put in the config to treat
                org.apache.ignite. classes as first class?

                On Tue, Apr 19, 2022 at 10:18 PM John Smith
                <java.dev....@gmail.com> wrote:

                    Ok, so I loaded the dump into Eclipse Mat and
                    followed:
                    
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                    - On the Histogram, I got over 30 entries for:
                    ChildFirstClassLoader
                    - Then I clicked on one of them "Merge Shortest
                    Path..." and picked "Exclude all
                    phantom/weak/soft references"
                    - Which then gave me: SqlDriverManager > Apache
                    Ignite JdbcThin Driver

                    So i'm guessing anything JDBC based. I should
                    copy into the task manager libs folder and my
                    jobs make the dependencies as compile only?

                    On Tue, Apr 19, 2022 at 12:18 PM Yaroslav
                    Tkachenko <yaros...@goldsky.io> wrote:

                        Also
                        
https://shopify.engineering/optimizing-apache-flink-applications-tips
                        might be helpful (has a section on
                        profiling, as well as classloading).

                        On Tue, Apr 19, 2022 at 4:35 AM Chesnay
                        Schepler <ches...@apache.org> wrote:

                            We have a very rough "guide" in the wiki
                            (it's just the specific steps I took to
                            debug another leak):
                            
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                            On 19/04/2022 12:01, huweihua wrote:

                            Hi, John

                            Sorry for the late reply. You can use
                            MAT[1] to analyze the dump file. Check
                            whether have too many loaded classes.

                            [1] https://www.eclipse.org/mat/

                            2022年4月18日 下午9:55，John Smith
                            <java.dev....@gmail.com> 写道：

                            Hi, can anyone help with this? I never
                            looked at a dump file before.

                            On Thu, Apr 14, 2022 at 11:59 AM John
                            Smith <java.dev....@gmail.com> wrote:

                                Hi, so I have a dump file. What do
                                I look for?

                                On Thu, Mar 31, 2022 at 3:28 PM
                                John Smith
                                <java.dev....@gmail.com> wrote:

                                    Ok so if there's a leak, if I
                                    manually stop the job and
                                    restart it from the UI
                                    multiple times, I won't see
                                    the issue because because the
                                    classes are unloaded correctly?


                                    On Thu, Mar 31, 2022 at 9:20
                                    AM huweihua
                                    <huweihua....@gmail.com> wrote:


                                        The difference is that
                                        manually canceling the job
                                        stops the JobMaster, but
                                        automatic failover keeps
                                        the JobMaster running. But
                                        looking on TaskManager, it
                                        doesn't make much difference

                                        2022年3月31日 上午4:01，John
                                        Smith
                                        <java.dev....@gmail.com>
                                        写道：

                                        Also if I manually cancel
                                        and restart the same job
                                        over and over is it the
                                        same as if flink was
                                        restarting a job due to
                                        failure?

                                        I.e: When I click "Cancel
                                        Job" on the UI is the job
                                        completely unloaded vs
                                        when the job scheduler
                                        restarts a job because if
                                        whatever reason?

                                        Lile this I'll stop and
                                        restart the job a few
                                        times or maybe I can
                                        trick my job to fail and
                                        have the scheduler
                                        restart it. Ok let me
                                        think about this...

                                        On Wed, Mar 30, 2022 at
                                        10:24 AM 胡伟华
                                        <huweihua....@gmail.com>
                                        wrote:

                                            So if I run the same
                                            jobs in my dev env
                                            will I still be able
                                            to see the similar
                                            dump?

                                            I think running the
                                            same job in dev
                                            should be
                                            reproducible, maybe
                                            you can have a try.

                                             If not I would have
                                            to wait at a low
                                            volume time to do it
                                            on production. Aldo
                                            if I recall the dump
                                            is as big as the JVM
                                            memory right so if I
                                            have 10GB configed
                                            for the JVM the dump
                                            will be 10GB file?

                                            Yes, JMAP will pause
                                            the JVM, the time of
                                            pause depends on the
                                            size to dump. you can
                                            use "jmap -dump:live"
                                            to dump only the
                                            reachable objects,
                                            this will take a
                                            brief pause

                                            2022年3月30日
                                            下午9:47，John Smith
                                            <java.dev....@gmail.com>
                                            写道：

                                            I have 3 task
                                            managers (see config
                                            below). There is
                                            total of 10 jobs
                                            with 25 slots being
                                            used.
                                            The jobs are 100%
                                            ETL I.e; They load
                                            Json, transform it
                                            and push it to JDBC,
                                            only 1 job of the 10
                                            is pushing to Apache
                                            Ignite cluster.

                                            FOR JMAP. I know
                                            that it will pause
                                            the task manager. So
                                            if I run the same
                                            jobs in my dev env
                                            will I still be able
                                            to see the similar
                                            dump? I I assume so.
                                            If not I would have
                                            to wait at a low
                                            volume time to do it
                                            on production. Aldo
                                            if I recall the dump
                                            is as big as the JVM
                                            memory right so if I
                                            have 10GB configed
                                            for the JVM the dump
                                            will be 10GB file?


                                            # Operating system
                                            has 16GB total.
                                            env.ssh.opts: -l
                                            flink
                                            -oStrictHostKeyChecking=no

                                            cluster.evenly-spread-out-slots:
                                            true

                                            taskmanager.memory.flink.size:
                                            10240m
                                            
taskmanager.memory.jvm-metaspace.size:
                                            2048m
                                            taskmanager.numberOfTaskSlots:
                                            16
                                            parallelism.default: 1

                                            high-availability:
                                            zookeeper
                                            high-availability.storageDir:
                                            file:///mnt/flink/ha/flink_1_14/
                                            high-availability.zookeeper.quorum:
                                            ...
                                            
high-availability.zookeeper.path.root:
                                            /flink_1_14
                                            high-availability.cluster-id:
                                            /flink_1_14_cluster_0001

                                            web.upload.dir:
                                            /mnt/flink/uploads/flink_1_14

                                            state.backend: rocksdb
                                            state.backend.incremental:
                                            true
                                            state.checkpoints.dir:
                                            
file:///mnt/flink/checkpoints/flink_1_14
                                            state.savepoints.dir:
                                            
file:///mnt/flink/savepoints/flink_1_14

                                            On Wed, Mar 30, 2022
                                            at 2:16 AM 胡伟华
                                            <huweihua....@gmail.com>
                                            wrote:

                                                Hi, John

                                                Could you tell
                                                us you
                                                application
                                                scenario? Is it
                                                a flink session
                                                cluster with a
                                                lot of jobs?

                                                Maybe you can
                                                try to dump the
                                                memory with jmap
                                                and use tools
                                                such as MAT to
                                                analyze whether
                                                there are
                                                abnormal classes
                                                and classloaders


                                                > 2022年3月30日
                                                上午6:09，John
                                                Smith
                                                <java.dev....@gmail.com>
                                                写道：
                                                >
                                                > Hi running 1.14.4
                                                >
                                                > My tasks
                                                manager still
                                                fails with
                                                java.lang.OutOfMemoryError:
                                                Metaspace. The
                                                metaspace
                                                out-of-memory
                                                error has
                                                occurred. This
                                                can mean two
                                                things: either
                                                the job requires
                                                a larger size of
                                                JVM metaspace to
                                                load classes or
                                                there is a class
                                                loading leak.
                                                >
                                                > I have 2GB of
                                                metaspace
                                                configed
                                                
taskmanager.memory.jvm-metaspace.size:
                                                2048m
                                                >
                                                > But the task
                                                nodes still fail.
                                                >
                                                > When looking
                                                at the UI
                                                metrics, the
                                                metaspace starts
                                                low. Now I see
                                                85% usage. It
                                                seems to be a
                                                class loading
                                                leak at this
                                                point, how can
                                                we debug this issue?

Re: How to debug Metaspace exception?

Reply via email to