Re: How to debug Metaspace exception?

Chesnay Schepler Tue, 26 Apr 2022 07:42:22 -0700

JDBC drivers are well-known for leaking classloaders unfortunately.


You have correctly identified your alternatives.

You must put the jdbc driver into /lib instead. Setting only theparent-first pattern shouldn't affect anything.That is only relevant if something is in both in /lib and the user-jar,telling Flink to prioritize what is in lib.




On 26/04/2022 15:35, John Smith wrote:

So I put classloader.parent-first-patterns.additional:"org.apache.ignite." in the task config and so far I don't think I'mgetting "java.lang.OutOfMemoryError: Metaspace" any more.


Or it's too early to tell.

Though now, the task managers are shutting down due to someother failures.

So maybe because tasks were failing and reloading often the taskmanager was running out of Metspace. But now maybe it's justcleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com>wrote:


    Or I can put in the config to treat org.apache.ignite. classes as
    first class?

    On Tue, Apr 19, 2022 at 10:18 PM John Smith
    <java.dev....@gmail.com> wrote:

        Ok, so I loaded the dump into Eclipse Mat and followed:
        
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

        - On the Histogram, I got over 30 entries for:
        ChildFirstClassLoader
        - Then I clicked on one of them "Merge Shortest Path..." and
        picked "Exclude all phantom/weak/soft references"
        - Which then gave me: SqlDriverManager > Apache Ignite
        JdbcThin Driver

        So i'm guessing anything JDBC based. I should copy into the
        task manager libs folder and my jobs make the dependencies as
        compile only?

        On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
        <yaros...@goldsky.io> wrote:

            Also
            
https://shopify.engineering/optimizing-apache-flink-applications-tips
            might be helpful (has a section on profiling, as well as
            classloading).

            On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
            <ches...@apache.org> wrote:

                We have a very rough "guide" in the wiki (it's just
                the specific steps I took to debug another leak):
                
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                On 19/04/2022 12:01, huweihua wrote:

                Hi, John

                Sorry for the late reply. You can use MAT[1] to
                analyze the dump file. Check whether have too many
                loaded classes.

                [1] https://www.eclipse.org/mat/

                2022年4月18日 下午9:55，John Smith
                <java.dev....@gmail.com> 写道：

                Hi, can anyone help with this? I never looked at a
                dump file before.

                On Thu, Apr 14, 2022 at 11:59 AM John Smith
                <java.dev....@gmail.com> wrote:

                    Hi, so I have a dump file. What do I look for?

                    On Thu, Mar 31, 2022 at 3:28 PM John Smith
                    <java.dev....@gmail.com> wrote:

                        Ok so if there's a leak, if I manually stop
                        the job and restart it from the UI multiple
                        times, I won't see the issue because because
                        the classes are unloaded correctly?


                        On Thu, Mar 31, 2022 at 9:20 AM huweihua
                        <huweihua....@gmail.com> wrote:


                            The difference is that manually
                            canceling the job stops the JobMaster,
                            but automatic failover keeps the
                            JobMaster running. But looking on
                            TaskManager, it doesn't make much difference

                            2022年3月31日 上午4:01，John Smith
                            <java.dev....@gmail.com> 写道：

                            Also if I manually cancel and restart
                            the same job over and over is it the
                            same as if flink was restarting a job
                            due to failure?

                            I.e: When I click "Cancel Job" on the
                            UI is the job completely unloaded vs
                            when the job scheduler restarts a job
                            because if whatever reason?

                            Lile this I'll stop and restart the job
                            a few times or maybe I can trick my job
                            to fail and have the scheduler restart
                            it. Ok let me think about this...

                            On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
                            <huweihua....@gmail.com> wrote:

                                So if I run the same jobs in my
                                dev env will I still be able to
                                see the similar dump?

                                I think running the same job in dev
                                should be reproducible, maybe you
                                can have a try.

                                 If not I would have to wait at a
                                low volume time to do it on
                                production. Aldo if I recall the
                                dump is as big as the JVM memory
                                right so if I have 10GB configed
                                for the JVM the dump will be 10GB
                                file?

                                Yes, JMAP will pause the JVM, the
                                time of pause depends on the size
                                to dump. you can use "jmap
                                -dump:live" to dump only the
                                reachable objects, this will take a
                                brief pause

                                2022年3月30日 下午9:47，John Smith
                                <java.dev....@gmail.com> 写道：

                                I have 3 task managers (see config
                                below). There is total of 10 jobs
                                with 25 slots being used.
                                The jobs are 100% ETL I.e; They
                                load Json, transform it and push
                                it to JDBC, only 1 job of the 10
                                is pushing to Apache Ignite cluster.

                                FOR JMAP. I know that it will
                                pause the task manager. So if I
                                run the same jobs in my dev env
                                will I still be able to see the
                                similar dump? I I assume so. If
                                not I would have to wait at a low
                                volume time to do it on
                                production. Aldo if I recall the
                                dump is as big as the JVM memory
                                right so if I have 10GB configed
                                for the JVM the dump will be 10GB
                                file?


                                # Operating system has 16GB total.
                                env.ssh.opts: -l flink
                                -oStrictHostKeyChecking=no

                                cluster.evenly-spread-out-slots: true

                                taskmanager.memory.flink.size: 10240m
                                taskmanager.memory.jvm-metaspace.size:
                                2048m
                                taskmanager.numberOfTaskSlots: 16
                                parallelism.default: 1

                                high-availability: zookeeper
                                high-availability.storageDir:
                                file:///mnt/flink/ha/flink_1_14/
                                high-availability.zookeeper.quorum:
                                ...
                                high-availability.zookeeper.path.root:
                                /flink_1_14
                                high-availability.cluster-id:
                                /flink_1_14_cluster_0001

                                web.upload.dir:
                                /mnt/flink/uploads/flink_1_14

                                state.backend: rocksdb
                                state.backend.incremental: true
                                state.checkpoints.dir:
                                file:///mnt/flink/checkpoints/flink_1_14
                                state.savepoints.dir:
                                file:///mnt/flink/savepoints/flink_1_14

                                On Wed, Mar 30, 2022 at 2:16 AM
                                胡伟华 <huweihua....@gmail.com> wrote:

                                    Hi, John

                                    Could you tell us you
                                    application scenario? Is it a
                                    flink session cluster with a
                                    lot of jobs?

                                    Maybe you can try to dump the
                                    memory with jmap and use tools
                                    such as MAT to analyze whether
                                    there are abnormal classes and
                                    classloaders


                                    > 2022年3月30日 上午6:09，John
                                    Smith <java.dev....@gmail.com>
                                    写道：
                                    >
                                    > Hi running 1.14.4
                                    >
                                    > My tasks manager still fails
                                    with
                                    java.lang.OutOfMemoryError:
                                    Metaspace. The metaspace
                                    out-of-memory error has
                                    occurred. This can mean two
                                    things: either the job
                                    requires a larger size of JVM
                                    metaspace to load classes or
                                    there is a class loading leak.
                                    >
                                    > I have 2GB of metaspace
                                    configed
                                    taskmanager.memory.jvm-metaspace.size:
                                    2048m
                                    >
                                    > But the task nodes still fail.
                                    >
                                    > When looking at the UI
                                    metrics, the metaspace starts
                                    low. Now I see 85% usage. It
                                    seems to be a class loading
                                    leak at this point, how can we
                                    debug this issue?

Re: How to debug Metaspace exception?

Reply via email to