Re: How to debug Metaspace exception?

Chesnay Schepler Sun, 01 May 2022 23:57:28 -0700

And you do should make sure that it is set for both processes!


On 02/05/2022 08:43, Chesnay Schepler wrote:

The setting itself isn't taskmanager specific; it applies to both thejob- and taskmanager process.


On 02/05/2022 05:29, John Smith wrote:

Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com>wrote:


    I assume you will take action on your side to track and fix the
    doc? :)

    On Thu, Apr 28, 2022 at 11:12 AM John Smith
    <java.dev....@gmail.com> wrote:

        Ok so to summarize...

        - Build my job jar and have the JDBC driver as a compile only
        dependency and copy the JDBC driver to flink lib folder.

        Or

        - Build my job jar and include JDBC driver in the shadow,
        plus copy the JDBC driver in the flink lib folder, plus  make
        an entry in config for
        |classloader.parent-first-patterns-additional|
        
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>


        On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
        <ches...@apache.org> wrote:

            I think what I meant was "either add it to /lib, or [if
            it is already in /lib but also bundled in the jar] add it
            to the parent-first patterns."

            On 28/04/2022 15:56, Chesnay Schepler wrote:

            Pretty sure, even though I seemingly documented it
            incorrectly :)

            On 28/04/2022 15:49, John Smith wrote:

            You sure?

             *

                /JDBC/: JDBC drivers leak references outside the
                user code classloader. To ensure that these classes
                are only loaded once you should either add the
                driver jars to Flink’s |lib/| folder, or add the
                driver classes to the list of parent-first loaded
                class via
                |classloader.parent-first-patterns-additional|
                
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.

                It says either or


            On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
            <ches...@apache.org> wrote:

                You're misinterpreting the docs.

                The parent/child-first classloading controls where
                Flink looks for a class /first/, specifically
                whether we first load from /lib or the user-jar.
                It does not allow you to load something from the
                user-jar in the parent classloader. That's just not
                how it works.

                It must be in /lib.

                On 27/04/2022 04:59, John Smith wrote:

                Hi Chesnay as per the docs...
                
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

                You can either put the jars in task manager lib
                folder or use
                |classloader.parent-first-patterns-additional|
                
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

                I prefer the latter like this: the
                dependency stays with the user-jar and not on the
                task manager.

                On Tue, Apr 26, 2022 at 9:52 PM John Smith
                <java.dev....@gmail.com> wrote:

                    Ok so I should put the Apache ignite and my
                    Microsoft drivers in the lib folders of my
                    task managers?

                    And then in my job jar only include them as
                    compile time dependencies?


                    On Tue, Apr 26, 2022 at 10:42 AM Chesnay
                    Schepler <ches...@apache.org> wrote:

                        JDBC drivers are well-known for leaking
                        classloaders unfortunately.

                        You have correctly identified your
                        alternatives.

                        You must put the jdbc driver into /lib
                        instead. Setting only the parent-first
                        pattern shouldn't affect anything.
                        That is only relevant if something is in
                        both in /lib and the user-jar, telling
                        Flink to prioritize what is in lib.



                        On 26/04/2022 15:35, John Smith wrote:

                        So I
                        put classloader.parent-first-patterns.additional:
                        "org.apache.ignite." in the task config
                        and so far I don't think I'm getting
                        "java.lang.OutOfMemoryError: Metaspace"
                        any more.

                        Or it's too early to tell.

                        Though now, the task managers are
                        shutting down due to some other failures.

                        So maybe because tasks were failing and
                        reloading often the task manager was
                        running out of Metspace. But now maybe
                        it's just cleanly shutting down.

                        On Wed, Apr 20, 2022 at 11:35 AM John
                        Smith <java.dev....@gmail.com> wrote:

                            Or I can put in the config to treat
                            org.apache.ignite. classes as first
                            class?

                            On Tue, Apr 19, 2022 at 10:18 PM John
                            Smith <java.dev....@gmail.com> wrote:

                                Ok, so I loaded the dump into
                                Eclipse Mat and followed:
                                
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                                - On the Histogram, I got over 30
                                entries for: ChildFirstClassLoader
                                - Then I clicked on one of them
                                "Merge Shortest Path..." and
                                picked "Exclude all
                                phantom/weak/soft references"
                                - Which then gave me:
                                SqlDriverManager > Apache Ignite
                                JdbcThin Driver

                                So i'm guessing anything JDBC
                                based. I should copy into the
                                task manager libs folder and my
                                jobs make the dependencies as
                                compile only?

                                On Tue, Apr 19, 2022 at 12:18 PM
                                Yaroslav Tkachenko
                                <yaros...@goldsky.io> wrote:

                                    Also
                                    
https://shopify.engineering/optimizing-apache-flink-applications-tips
                                    might be helpful (has a
                                    section on profiling, as well
                                    as classloading).

                                    On Tue, Apr 19, 2022 at 4:35
                                    AM Chesnay Schepler
                                    <ches...@apache.org> wrote:

                                        We have a very rough
                                        "guide" in the wiki (it's
                                        just the specific steps I
                                        took to debug another leak):
                                        
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                                        On 19/04/2022 12:01,
                                        huweihua wrote:

                                        Hi, John

                                        Sorry for the late
                                        reply. You can use
                                        MAT[1] to analyze the
                                        dump file. Check whether
                                        have too many loaded
                                        classes.

                                        [1]
                                        https://www.eclipse.org/mat/

                                        2022年4月18日
                                        下午9:55，John Smith
                                        <java.dev....@gmail.com>
                                        写道：

                                        Hi, can anyone help
                                        with this? I never
                                        looked at a dump file
                                        before.

                                        On Thu, Apr 14, 2022 at
                                        11:59 AM John Smith
                                        <java.dev....@gmail.com>
                                        wrote:

                                            Hi, so I have a
                                            dump file. What do
                                            I look for?

                                            On Thu, Mar 31,
                                            2022 at 3:28 PM
                                            John Smith
                                            <java.dev....@gmail.com>
                                            wrote:

                                                Ok so if
                                                there's a leak,
                                                if I
                                                manually stop
                                                the job and
                                                restart it from
                                                the UI multiple
                                                times, I won't
                                                see the issue
                                                because because
                                                the classes are
                                                unloaded
                                                correctly?


                                                On Thu, Mar 31,
                                                2022 at 9:20 AM
                                                huweihua
                                                <huweihua....@gmail.com>
                                                wrote:


                                                    The
                                                    difference
                                                    is that
                                                    manually
                                                    canceling
                                                    the job
                                                    stops the
                                                    JobMaster,
                                                    but
                                                    automatic
                                                    failover
                                                    keeps the
                                                    JobMaster
                                                    running.
                                                    But looking
                                                    on
                                                    TaskManager,
                                                    it doesn't
                                                    make much
                                                    difference

                                                    2022年3月31日
                                                    上午4:01，John
                                                    Smith
                                                    <java.dev....@gmail.com>
                                                    写道：

                                                    Also if I
                                                    manually
                                                    cancel and
                                                    restart
                                                    the same
                                                    job over
                                                    and over
                                                    is it the
                                                    same as if
                                                    flink was
                                                    restarting
                                                    a job due
                                                    to failure?

                                                    I.e: When
                                                    I click
                                                    "Cancel
                                                    Job" on
                                                    the UI is
                                                    the job
                                                    completely
                                                    unloaded
                                                    vs when
                                                    the job
                                                    scheduler
                                                    restarts a
                                                    job
                                                    because if
                                                    whatever
                                                    reason?

                                                    Lile this
                                                    I'll stop
                                                    and
                                                    restart
                                                    the job a
                                                    few times
                                                    or maybe I
                                                    can trick
                                                    my job to
                                                    fail and
                                                    have the
                                                    scheduler
                                                    restart
                                                    it. Ok let
                                                    me think
                                                    about this...

                                                    On Wed,
                                                    Mar 30,
                                                    2022 at
                                                    10:24 AM
                                                    胡伟华
                                                    <huweihua....@gmail.com>
                                                    wrote:

                                                        I
                                                        think
                                                        running
                                                        the
                                                        same
                                                        job in
                                                        dev
                                                        should
                                                        be
                                                        reproducible,
                                                        maybe
                                                        you
                                                        can
                                                        have a
                                                        try.

                                                         If
                                                        not I
                                                        would
                                                        have
                                                        to
                                                        wait
                                                        at a
                                                        low
                                                        volume
                                                        time
                                                        to do
                                                        it on
                                                        production.
                                                        Aldo
                                                        if I
                                                        recall
                                                        the
                                                        dump
                                                        is as
                                                        big
                                                        as
                                                        the
                                                        JVM
                                                        memory
                                                        right
                                                        so if
                                                        I
                                                        have
                                                        10GB
                                                        configed
                                                        for
                                                        the
                                                        JVM
                                                        the
                                                        dump
                                                        will
                                                        be
                                                        10GB
                                                        file?

                                                        Yes,
                                                        JMAP
                                                        will
                                                        pause
                                                        the
                                                        JVM,
                                                        the
                                                        time
                                                        of
                                                        pause
                                                        depends
                                                        on the
                                                        size
                                                        to
                                                        dump.
                                                        you
                                                        can
                                                        use
                                                        "jmap
                                                        -dump:live"
                                                        to
                                                        dump
                                                        only
                                                        the
                                                        reachable
                                                        objects,
                                                        this
                                                        will
                                                        take a
                                                        brief
                                                        pause

                                                        2022年3月30日
                                                        下午9:47，John
                                                        Smith
                                                        <java.dev....@gmail.com>
                                                        写道：

                                                        I
                                                        have
                                                        3
                                                        task
                                                        managers
                                                        (see
                                                        config
                                                        below).
                                                        There
                                                        is
                                                        total
                                                        of 10
                                                        jobs
                                                        with
                                                        25
                                                        slots
                                                        being
                                                        used.
                                                        The
                                                        jobs
                                                        are
                                                        100%
                                                        ETL
                                                        I.e;
                                                        They
                                                        load
                                                        Json,
                                                        transform
                                                        it
                                                        and
                                                        push
                                                        it to
                                                        JDBC,
                                                        only
                                                        1 job
                                                        of
                                                        the
                                                        10 is
                                                        pushing
                                                        to
                                                        Apache
                                                        Ignite
                                                        cluster.

                                                        FOR
                                                        JMAP.
                                                        I
                                                        know
                                                        that
                                                        it
                                                        will
                                                        pause
                                                        the
                                                        task
                                                        manager.
                                                        So if
                                                        I run
                                                        the
                                                        same
                                                        jobs
                                                        in my
                                                        dev
                                                        env
                                                        will
                                                        I
                                                        still
                                                        be
                                                        able
                                                        to
                                                        see
                                                        the
                                                        similar
                                                        dump?
                                                        I I
                                                        assume
                                                        so.
                                                        If
                                                        not I
                                                        would
                                                        have
                                                        to
                                                        wait
                                                        at a
                                                        low
                                                        volume
                                                        time
                                                        to do
                                                        it on
                                                        production.
                                                        Aldo
                                                        if I
                                                        recall
                                                        the
                                                        dump
                                                        is as
                                                        big
                                                        as
                                                        the
                                                        JVM
                                                        memory
                                                        right
                                                        so if
                                                        I
                                                        have
                                                        10GB
                                                        configed
                                                        for
                                                        the
                                                        JVM
                                                        the
                                                        dump
                                                        will
                                                        be
                                                        10GB
                                                        file?


                                                        #
                                                        Operating
                                                        system
                                                        has
                                                        16GB
                                                        total.
                                                        env.ssh.opts:
                                                        -l
                                                        flink
                                                        
-oStrictHostKeyChecking=no

                                                        
cluster.evenly-spread-out-slots:
                                                        true

                                                        
taskmanager.memory.flink.size:
                                                        10240m
                                                        
taskmanager.memory.jvm-metaspace.size:
                                                        2048m
                                                        
taskmanager.numberOfTaskSlots:
                                                        16
                                                        parallelism.default:
                                                        1

                                                        high-availability:
                                                        zookeeper
                                                        
high-availability.storageDir:
                                                        
file:///mnt/flink/ha/flink_1_14/
                                                        
high-availability.zookeeper.quorum:
                                                        ...
                                                        
high-availability.zookeeper.path.root:
                                                        /flink_1_14
                                                        
high-availability.cluster-id:
                                                        /flink_1_14_cluster_0001

                                                        web.upload.dir:
                                                        
/mnt/flink/uploads/flink_1_14

                                                        state.backend:
                                                        rocksdb
                                                        
state.backend.incremental:
                                                        true
                                                        state.checkpoints.dir:
                                                        
file:///mnt/flink/checkpoints/flink_1_14
                                                        state.savepoints.dir:
                                                        
file:///mnt/flink/savepoints/flink_1_14

                                                        On
                                                        Wed,
                                                        Mar
                                                        30,
                                                        2022
                                                        at
                                                        2:16
                                                        AM
                                                        胡伟华
                                                        <huweihua....@gmail.com>
                                                        wrote:

                                                            Hi,
                                                            John

                                                            Could
                                                            you
                                                            tell
                                                            us
                                                            you
                                                            application
                                                            scenario?
                                                            Is
                                                            it
                                                            a
                                                            flink
                                                            session
                                                            cluster
                                                            with
                                                            a
                                                            lot
                                                            of
                                                            jobs?

                                                            Maybe
                                                            you
                                                            can
                                                            try
                                                            to
                                                            dump
                                                            the
                                                            memory
                                                            with
                                                            jmap
                                                            and
                                                            use
                                                            tools
                                                            such
                                                            as
                                                            MAT
                                                            to
                                                            analyze
                                                            whether
                                                            there
                                                            are
                                                            abnormal
                                                            classes
                                                            and
                                                            classloaders


                                                            >
                                                            2022年3月30日
                                                            上午6:09，John
                                                            Smith
                                                            
<java.dev....@gmail.com>
                                                            写道：
                                                            >
                                                            >
                                                            Hi
                                                            running
                                                            1.14.4
                                                            >
                                                            >
                                                            My
                                                            tasks
                                                            manager
                                                            still
                                                            fails
                                                            with
                                                            
java.lang.OutOfMemoryError:
                                                            Metaspace.
                                                            The
                                                            metaspace
                                                            out-of-memory
                                                            error
                                                            has
                                                            occurred.
                                                            This
                                                            can
                                                            mean
                                                            two
                                                            things:
                                                            either
                                                            the
                                                            job
                                                            requires
                                                            a
                                                            larger
                                                            size
                                                            of
                                                            JVM
                                                            metaspace
                                                            to
                                                            load
                                                            classes
                                                            or
                                                            there
                                                            is
                                                            a
                                                            class
                                                            loading
                                                            leak.
                                                            >
                                                            >
                                                            I
                                                            have
                                                            2GB
                                                            of
                                                            metaspace
                                                            configed
                                                            
taskmanager.memory.jvm-metaspace.size:
                                                            2048m
                                                            >
                                                            >
                                                            But
                                                            the
                                                            task
                                                            nodes
                                                            still
                                                            fail.
                                                            >
                                                            >
                                                            When
                                                            looking
                                                            at
                                                            the
                                                            UI
                                                            metrics,
                                                            the
                                                            metaspace
                                                            starts
                                                            low.
                                                            Now
                                                            I
                                                            see
                                                            85%
                                                            usage.
                                                            It
                                                            seems
                                                            to
                                                            be
                                                            a
                                                            class
                                                            loading
                                                            leak
                                                            at
                                                            this
                                                            point,
                                                            how
                                                            can
                                                            we
                                                            debug
                                                            this
                                                            issue?

Re: How to debug Metaspace exception?

Reply via email to