Re: How to debug Metaspace exception?

Chesnay Schepler Mon, 02 May 2022 06:36:21 -0700

yes.

But if you can ensure that the driver isn't bundled by any user-jar youcan also skip the pattern configuration step.

The pattern looks correct formatting-wise; you could try whethercom.microsoft.sqlserver.jdbc. is enough to solve the issue.


On 02/05/2022 14:41, John Smith wrote:

Oh, so I should copy the jars to the lib folder andset classloader.parent-first-patterns.additional:"org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the taskmanagers and job managers?

Also is my pattern correct?"org.apache.ignite.;com.microsoft.sqlserver.jdbc."

Just to be sure I'm running a standalone cluster using zookeeper. So Ihave 3 zookeepers, 3 job managers and 3 task managers.

On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ches...@apache.org>wrote:


    And you do should make sure that it is set for both processes!

    On 02/05/2022 08:43, Chesnay Schepler wrote:

    The setting itself isn't taskmanager specific; it applies to both
    the job- and taskmanager process.

    On 02/05/2022 05:29, John Smith wrote:

    Also just to be sure this is a Task Manager setting right?

    On Thu, Apr 28, 2022 at 11:13 AM John Smith
    <java.dev....@gmail.com> wrote:

        I assume you will take action on your side to track and fix
        the doc? :)

        On Thu, Apr 28, 2022 at 11:12 AM John Smith
        <java.dev....@gmail.com> wrote:

            Ok so to summarize...

            - Build my job jar and have the JDBC driver as a compile
            only dependency and copy the JDBC driver to flink lib
            folder.

            Or

            - Build my job jar and include JDBC driver in the
            shadow, plus copy the JDBC driver in the flink lib
            folder, plus  make an entry in config for
            |classloader.parent-first-patterns-additional|
            
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>


            On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
            <ches...@apache.org> wrote:

                I think what I meant was "either add it to /lib, or
                [if it is already in /lib but also bundled in the
                jar] add it to the parent-first patterns."

                On 28/04/2022 15:56, Chesnay Schepler wrote:

                Pretty sure, even though I seemingly documented it
                incorrectly :)

                On 28/04/2022 15:49, John Smith wrote:

                You sure?

                 *

                    /JDBC/: JDBC drivers leak references outside
                    the user code classloader. To ensure that
                    these classes are only loaded once you should
                    either add the driver jars to Flink’s
                    |lib/| folder, or add the driver classes to
                    the list of parent-first loaded class via
                    |classloader.parent-first-patterns-additional|
                    
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.

                    It says either or


                On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
                <ches...@apache.org> wrote:

                    You're misinterpreting the docs.

                    The parent/child-first classloading controls
                    where Flink looks for a class /first/,
                    specifically whether we first load from /lib
                    or the user-jar.
                    It does not allow you to load something from
                    the user-jar in the parent classloader. That's
                    just not how it works.

                    It must be in /lib.

                    On 27/04/2022 04:59, John Smith wrote:

                    Hi Chesnay as per the docs...
                    
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

                    You can either put the jars in task manager
                    lib folder or use
                    |classloader.parent-first-patterns-additional|
                    
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

                    I prefer the latter like this: the
                    dependency stays with the user-jar and not on
                    the task manager.

                    On Tue, Apr 26, 2022 at 9:52 PM John Smith
                    <java.dev....@gmail.com> wrote:

                        Ok so I should put the Apache ignite and
                        my Microsoft drivers in the lib folders
                        of my task managers?

                        And then in my job jar only include them
                        as compile time dependencies?


                        On Tue, Apr 26, 2022 at 10:42 AM Chesnay
                        Schepler <ches...@apache.org> wrote:

                            JDBC drivers are well-known for
                            leaking classloaders unfortunately.

                            You have correctly identified your
                            alternatives.

                            You must put the jdbc driver into
                            /lib instead. Setting only the
                            parent-first pattern shouldn't affect
                            anything.
                            That is only relevant if something is
                            in both in /lib and the user-jar,
                            telling Flink to prioritize what is
                            in lib.



                            On 26/04/2022 15:35, John Smith wrote:

                            So I
                            put classloader.parent-first-patterns.additional:
                            "org.apache.ignite." in the task
                            config and so far I don't think I'm
                            getting "java.lang.OutOfMemoryError:
                            Metaspace" any more.

                            Or it's too early to tell.

                            Though now, the task managers are
                            shutting down due to some
                            other failures.

                            So maybe because tasks were failing
                            and reloading often the task manager
                            was running out of Metspace. But now
                            maybe it's just cleanly shutting down.

                            On Wed, Apr 20, 2022 at 11:35 AM
                            John Smith <java.dev....@gmail.com>
                            wrote:

                                Or I can put in the config to
                                treat org.apache.ignite. classes
                                as first class?

                                On Tue, Apr 19, 2022 at 10:18 PM
                                John Smith
                                <java.dev....@gmail.com> wrote:

                                    Ok, so I loaded the dump
                                    into Eclipse Mat and
                                    followed:
                                    
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                                    - On the Histogram, I got
                                    over 30 entries for:
                                    ChildFirstClassLoader
                                    - Then I clicked on one of
                                    them "Merge Shortest
                                    Path..." and picked "Exclude
                                    all phantom/weak/soft
                                    references"
                                    - Which then gave me:
                                    SqlDriverManager > Apache
                                    Ignite JdbcThin Driver

                                    So i'm guessing anything
                                    JDBC based. I should copy
                                    into the task manager libs
                                    folder and my jobs make the
                                    dependencies as compile only?

                                    On Tue, Apr 19, 2022 at
                                    12:18 PM Yaroslav Tkachenko
                                    <yaros...@goldsky.io> wrote:

                                        Also
                                        
https://shopify.engineering/optimizing-apache-flink-applications-tips
                                        might be helpful (has a
                                        section on profiling, as
                                        well as classloading).

                                        On Tue, Apr 19, 2022 at
                                        4:35 AM Chesnay Schepler
                                        <ches...@apache.org> wrote:

                                            We have a very rough
                                            "guide" in the wiki
                                            (it's just the
                                            specific steps I
                                            took to debug
                                            another leak):
                                            
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                                            On 19/04/2022 12:01,
                                            huweihua wrote:

                                            Hi, John

                                            Sorry for the late
                                            reply. You can use
                                            MAT[1] to analyze
                                            the dump file.
                                            Check whether have
                                            too many loaded
                                            classes.

                                            [1]
                                            https://www.eclipse.org/mat/

                                            2022年4月18日
                                            下午9:55，John Smith
                                            <java.dev....@gmail.com>
                                            写道：

                                            Hi, can anyone
                                            help with this? I
                                            never looked at a
                                            dump file before.

                                            On Thu, Apr 14,
                                            2022 at 11:59 AM
                                            John Smith
                                            <java.dev....@gmail.com>
                                            wrote:

                                                Hi, so I have
                                                a dump file.
                                                What do I look
                                                for?

                                                On Thu, Mar
                                                31, 2022 at
                                                3:28 PM John
                                                Smith
                                                <java.dev....@gmail.com>
                                                wrote:

                                                    Ok so if
                                                    there's a
                                                    leak, if I
                                                    manually stop
                                                    the job
                                                    and
                                                    restart it
                                                    from the
                                                    UI
                                                    multiple
                                                    times, I
                                                    won't see
                                                    the issue
                                                    because
                                                    because
                                                    the
                                                    classes
                                                    are
                                                    unloaded
                                                    correctly?


                                                    On Thu,
                                                    Mar 31,
                                                    2022 at
                                                    9:20 AM
                                                    huweihua
                                                    <huweihua....@gmail.com>
                                                    wrote:


                                                        The
                                                        difference
                                                        is
                                                        that
                                                        manually
                                                        canceling
                                                        the
                                                        job
                                                        stops
                                                        the
                                                        JobMaster,
                                                        but
                                                        automatic
                                                        failover
                                                        keeps
                                                        the
                                                        JobMaster
                                                        running.
                                                        But
                                                        looking
                                                        on
                                                        TaskManager,
                                                        it
                                                        doesn't
                                                        make
                                                        much
                                                        difference

                                                        2022年3月31日
                                                        上午4:01，John
                                                        Smith
                                                        <java.dev....@gmail.com>
                                                        写道：

                                                        Also
                                                        if I
                                                        manually
                                                        cancel
                                                        and
                                                        restart
                                                        the
                                                        same
                                                        job
                                                        over
                                                        and
                                                        over
                                                        is it
                                                        the
                                                        same
                                                        as if
                                                        flink
                                                        was
                                                        restarting
                                                        a job
                                                        due
                                                        to
                                                        failure?

                                                        I.e:
                                                        When
                                                        I
                                                        click
                                                        "Cancel
                                                        Job"
                                                        on
                                                        the
                                                        UI is
                                                        the
                                                        job
                                                        completely
                                                        unloaded
                                                        vs
                                                        when
                                                        the
                                                        job
                                                        scheduler
                                                        restarts
                                                        a job
                                                        because
                                                        if
                                                        whatever
                                                        reason?

                                                        Lile
                                                        this
                                                        I'll
                                                        stop
                                                        and
                                                        restart
                                                        the
                                                        job a
                                                        few
                                                        times
                                                        or
                                                        maybe
                                                        I can
                                                        trick
                                                        my
                                                        job
                                                        to
                                                        fail
                                                        and
                                                        have
                                                        the
                                                        scheduler
                                                        restart
                                                        it.
                                                        Ok
                                                        let
                                                        me
                                                        think
                                                        about
                                                        this...

                                                        On
                                                        Wed,
                                                        Mar
                                                        30,
                                                        2022
                                                        at
                                                        10:24
                                                        AM
                                                        胡伟华
                                                        <huweihua....@gmail.com>
                                                        wrote:

                                                            I
                                                            think
                                                            running
                                                            the
                                                            same
                                                            job
                                                            in
                                                            dev
                                                            should
                                                            be
                                                            reproducible,
                                                            maybe
                                                            you
                                                            can
                                                            have
                                                            a
                                                            try.

                                                             If
                                                            not
                                                            I
                                                            would
                                                            have
                                                            to
                                                            wait
                                                            at
                                                            a
                                                            low
                                                            volume
                                                            time
                                                            to
                                                            do
                                                            it
                                                            on
                                                            production.
                                                            Aldo
                                                            if
                                                            I
                                                            recall
                                                            the
                                                            dump
                                                            is
                                                            as
                                                            big
                                                            as
                                                            the
                                                            JVM
                                                            memory
                                                            right
                                                            so
                                                            if
                                                            I
                                                            have
                                                            10GB
                                                            configed
                                                            for
                                                            the
                                                            JVM
                                                            the
                                                            dump
                                                            will
                                                            be
                                                            10GB
                                                            file?

                                                            Yes,
                                                            JMAP
                                                            will
                                                            pause
                                                            the
                                                            JVM,
                                                            the
                                                            time
                                                            of
                                                            pause
                                                            depends
                                                            on
                                                            the
                                                            size
                                                            to
                                                            dump.
                                                            you
                                                            can
                                                            use
                                                            "jmap
                                                            -dump:live"
                                                            to
                                                            dump
                                                            only
                                                            the
                                                            reachable
                                                            objects,
                                                            this
                                                            will
                                                            take
                                                            a
                                                            brief
                                                            pause

                                                            2022年3月30日
                                                            下午9:47，John
                                                            Smith
                                                            
<java.dev....@gmail.com>
                                                            写道：

                                                            I
                                                            have
                                                            3
                                                            task
                                                            managers
                                                            (see
                                                            config
                                                            below).
                                                            There
                                                            is
                                                            total
                                                            of
                                                            10
                                                            jobs
                                                            with
                                                            25
                                                            slots
                                                            being
                                                            used.
                                                            The
                                                            jobs
                                                            are
                                                            100%
                                                            ETL
                                                            I.e;
                                                            They
                                                            load
                                                            Json,
                                                            transform
                                                            it
                                                            and
                                                            push
                                                            it
                                                            to
                                                            JDBC,
                                                            only
                                                            1
                                                            job
                                                            of
                                                            the
                                                            10
                                                            is
                                                            pushing
                                                            to
                                                            Apache
                                                            Ignite
                                                            cluster.

                                                            FOR
                                                            JMAP.
                                                            I
                                                            know
                                                            that
                                                            it
                                                            will
                                                            pause
                                                            the
                                                            task
                                                            manager.
                                                            So
                                                            if
                                                            I
                                                            run
                                                            the
                                                            same
                                                            jobs
                                                            in
                                                            my
                                                            dev
                                                            env
                                                            will
                                                            I
                                                            still
                                                            be
                                                            able
                                                            to
                                                            see
                                                            the
                                                            similar
                                                            dump?
                                                            I
                                                            I
                                                            assume
                                                            so.
                                                            If
                                                            not
                                                            I
                                                            would
                                                            have
                                                            to
                                                            wait
                                                            at
                                                            a
                                                            low
                                                            volume
                                                            time
                                                            to
                                                            do
                                                            it
                                                            on
                                                            production.
                                                            Aldo
                                                            if
                                                            I
                                                            recall
                                                            the
                                                            dump
                                                            is
                                                            as
                                                            big
                                                            as
                                                            the
                                                            JVM
                                                            memory
                                                            right
                                                            so
                                                            if
                                                            I
                                                            have
                                                            10GB
                                                            configed
                                                            for
                                                            the
                                                            JVM
                                                            the
                                                            dump
                                                            will
                                                            be
                                                            10GB
                                                            file?


                                                            #
                                                            Operating
                                                            system
                                                            has
                                                            16GB
                                                            total.
                                                            env.ssh.opts:
                                                            -l
                                                            flink
                                                            
-oStrictHostKeyChecking=no

                                                            
cluster.evenly-spread-out-slots:
                                                            true

                                                            
taskmanager.memory.flink.size:
                                                            10240m
                                                            
taskmanager.memory.jvm-metaspace.size:
                                                            2048m
                                                            
taskmanager.numberOfTaskSlots:
                                                            16
                                                            parallelism.default:
                                                            1

                                                            high-availability:
                                                            zookeeper
                                                            
high-availability.storageDir:
                                                            
file:///mnt/flink/ha/flink_1_14/
                                                            
high-availability.zookeeper.quorum:
                                                            ...
                                                            
high-availability.zookeeper.path.root:
                                                            /flink_1_14
                                                            
high-availability.cluster-id:
                                                            
/flink_1_14_cluster_0001

                                                            web.upload.dir:
                                                            
/mnt/flink/uploads/flink_1_14

                                                            state.backend:
                                                            rocksdb
                                                            
state.backend.incremental:
                                                            true
                                                            
state.checkpoints.dir:
                                                            
file:///mnt/flink/checkpoints/flink_1_14
                                                            
state.savepoints.dir:
                                                            
file:///mnt/flink/savepoints/flink_1_14

                                                            On
                                                            Wed,
                                                            Mar
                                                            30,
                                                            2022
                                                            at
                                                            2:16
                                                            AM
                                                            胡伟华
                                                            
<huweihua....@gmail.com>
                                                            wrote:

                                                                Hi,
                                                                John

                                                                Could
                                                                you
                                                                tell
                                                                us
                                                                you
                                                                application
                                                                scenario?
                                                                Is
                                                                it
                                                                a
                                                                flink
                                                                session
                                                                cluster
                                                                with
                                                                a
                                                                lot
                                                                of
                                                                jobs?

                                                                Maybe
                                                                you
                                                                can
                                                                try
                                                                to
                                                                dump
                                                                the
                                                                memory
                                                                with
                                                                jmap
                                                                and
                                                                use
                                                                tools
                                                                such
                                                                as
                                                                MAT
                                                                to
                                                                analyze
                                                                whether
                                                                there
                                                                are
                                                                abnormal
                                                                classes
                                                                and
                                                                classloaders


                                                                >
                                                                2022年3月30日
                                                                上午6:09，John
                                                                Smith
                                                                
<java.dev....@gmail.com>
                                                                写道：
                                                                >

                                                                >
                                                                Hi
                                                                running
                                                                1.14.4
                                                                >

                                                                >
                                                                My
                                                                tasks
                                                                manager
                                                                still
                                                                fails
                                                                with
                                                                
java.lang.OutOfMemoryError:
                                                                Metaspace.
                                                                The
                                                                metaspace
                                                                out-of-memory
                                                                error
                                                                has
                                                                occurred.
                                                                This
                                                                can
                                                                mean
                                                                two
                                                                things:
                                                                either
                                                                the
                                                                job
                                                                requires
                                                                a
                                                                larger
                                                                size
                                                                of
                                                                JVM
                                                                metaspace
                                                                to
                                                                load
                                                                classes
                                                                or
                                                                there
                                                                is
                                                                a
                                                                class
                                                                loading
                                                                leak.
                                                                >

                                                                >
                                                                I
                                                                have
                                                                2GB
                                                                of
                                                                metaspace
                                                                configed
                                                                
taskmanager.memory.jvm-metaspace.size:
                                                                2048m
                                                                >

                                                                >
                                                                But
                                                                the
                                                                task
                                                                nodes
                                                                still
                                                                fail.
                                                                >

                                                                >
                                                                When
                                                                looking
                                                                at
                                                                the
                                                                UI
                                                                metrics,
                                                                the
                                                                metaspace
                                                                starts
                                                                low.
                                                                Now
                                                                I
                                                                see
                                                                85%
                                                                usage.
                                                                It
                                                                seems
                                                                to
                                                                be
                                                                a
                                                                class
                                                                loading
                                                                leak
                                                                at
                                                                this
                                                                point,
                                                                how
                                                                can
                                                                we
                                                                debug
                                                                this
                                                                issue?

Re: How to debug Metaspace exception?

Reply via email to