Re: How to debug Metaspace exception?

John Smith Mon, 02 May 2022 12:19:01 -0700

Ok, I don't think I'm running user code on the job manager. Basically. I'm
running a standalone cluster.


3 zookeepers
3 job managers
3 task managers.

I submit my jobs via the UI.

But in case I'll copy the config iver to the job managers.



On Mon, May 2, 2022 at 11:00 AM Chesnay Schepler <ches...@apache.org> wrote:

> There are cases where user-code is run on the JobManager.
> I'm not sure whether though that applies to the JDBC sources.
>
> On 02/05/2022 15:45, John Smith wrote:
>
> Why do the JDBC jars need to be on the job manager node though?
>
> On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ches...@apache.org>
> wrote:
>
>> yes.
>> But if you can ensure that the driver isn't bundled by any user-jar you
>> can also skip the pattern configuration step.
>>
>> The pattern looks correct formatting-wise; you could try whether
>> com.microsoft.sqlserver.jdbc. is enough to solve the issue.
>>
>> On 02/05/2022 14:41, John Smith wrote:
>>
>> Oh, so I should copy the jars to the lib folder and
>> set classloader.parent-first-patterns.additional:
>> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
>> managers and job managers?
>>
>> Also is my pattern correct?
>> "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>>
>> Just to be sure I'm running a standalone cluster using zookeeper. So I
>> have 3 zookeepers, 3 job managers and 3 task managers.
>>
>>
>> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> And you do should make sure that it is set for both processes!
>>>
>>> On 02/05/2022 08:43, Chesnay Schepler wrote:
>>>
>>> The setting itself isn't taskmanager specific; it applies to both the
>>> job- and taskmanager process.
>>>
>>> On 02/05/2022 05:29, John Smith wrote:
>>>
>>> Also just to be sure this is a Task Manager setting right?
>>>
>>> On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> I assume you will take action on your side to track and fix the doc? :)
>>>>
>>>> On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok so to summarize...
>>>>>
>>>>> - Build my job jar and have the JDBC driver as a compile only
>>>>> dependency and copy the JDBC driver to flink lib folder.
>>>>>
>>>>> Or
>>>>>
>>>>> - Build my job jar and include JDBC driver in the shadow, plus copy
>>>>> the JDBC driver in the flink lib folder, plus  make an entry in config for
>>>>> classloader.parent-first-patterns-additional
>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>
>>>>>
>>>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think what I meant was "either add it to /lib, or [if it is already
>>>>>> in /lib but also bundled in the jar] add it to the parent-first 
>>>>>> patterns."
>>>>>>
>>>>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>>>>
>>>>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>>>>
>>>>>> On 28/04/2022 15:49, John Smith wrote:
>>>>>>
>>>>>> You sure?
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *JDBC*: JDBC drivers leak references outside the user code
>>>>>>    classloader. To ensure that these classes are only loaded once you 
>>>>>> should
>>>>>>    either add the driver jars to Flink’s lib/ folder, or add the
>>>>>>    driver classes to the list of parent-first loaded class via
>>>>>>    classloader.parent-first-patterns-additional
>>>>>>    
>>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>    .
>>>>>>
>>>>>>    It says either or
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> You're misinterpreting the docs.
>>>>>>>
>>>>>>> The parent/child-first classloading controls where Flink looks for a
>>>>>>> class *first*, specifically whether we first load from /lib or the
>>>>>>> user-jar.
>>>>>>> It does not allow you to load something from the user-jar in the
>>>>>>> parent classloader. That's just not how it works.
>>>>>>>
>>>>>>> It must be in /lib.
>>>>>>>
>>>>>>> On 27/04/2022 04:59, John Smith wrote:
>>>>>>>
>>>>>>> Hi Chesnay as per the docs...
>>>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>>>
>>>>>>> You can either put the jars in task manager lib folder or use
>>>>>>> classloader.parent-first-patterns-additional
>>>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>>
>>>>>>> I prefer the latter like this: the dependency stays with the
>>>>>>> user-jar and not on the task manager.
>>>>>>>
>>>>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in
>>>>>>>> the lib folders of my task managers?
>>>>>>>>
>>>>>>>> And then in my job jar only include them as compile time
>>>>>>>> dependencies?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <
>>>>>>>> ches...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>>>>>
>>>>>>>>> You have correctly identified your alternatives.
>>>>>>>>>
>>>>>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>>>>>> parent-first pattern shouldn't affect anything.
>>>>>>>>> That is only relevant if something is in both in /lib and the
>>>>>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>>>>>
>>>>>>>>> So I put classloader.parent-first-patterns.additional:
>>>>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>>>>>
>>>>>>>>> Or it's too early to tell.
>>>>>>>>>
>>>>>>>>> Though now, the task managers are shutting down due to some
>>>>>>>>> other failures.
>>>>>>>>>
>>>>>>>>> So maybe because tasks were failing and reloading often the task
>>>>>>>>> manager was running out of Metspace. But now maybe it's just
>>>>>>>>> cleanly shutting down.
>>>>>>>>>
>>>>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <
>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>>>>>> first class?
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <
>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>>
>>>>>>>>>>> - On the Histogram, I got over 30 entries for:
>>>>>>>>>>> ChildFirstClassLoader
>>>>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and
>>>>>>>>>>> picked "Exclude all phantom/weak/soft references"
>>>>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>>>>>> Driver
>>>>>>>>>>>
>>>>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>>>>>> manager libs folder and my jobs make the dependencies as compile 
>>>>>>>>>>> only?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>>>>>> yaros...@goldsky.io> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Also
>>>>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>>>>> might be helpful (has a section on profiling, as well as 
>>>>>>>>>>>> classloading).
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>>>>>> ches...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the
>>>>>>>>>>>>> specific steps I took to debug another leak):
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the
>>>>>>>>>>>>> dump file. Check whether have too many loaded classes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>>>>>> before.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and
>>>>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue 
>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>> because the classes are unloaded correctly?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <
>>>>>>>>>>>>>>> huweihua....@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. 
>>>>>>>>>>>>>>>> But looking
>>>>>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to 
>>>>>>>>>>>>>>>> failure?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job
>>>>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job 
>>>>>>>>>>>>>>>> because if
>>>>>>>>>>>>>>>> whatever reason?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or
>>>>>>>>>>>>>>>> maybe I can trick my job to fail and have the scheduler 
>>>>>>>>>>>>>>>> restart it. Ok let
>>>>>>>>>>>>>>>> me think about this...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <
>>>>>>>>>>>>>>>> huweihua....@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be
>>>>>>>>>>>>>>>>> able to see the similar dump?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think running the same job in dev should be
>>>>>>>>>>>>>>>>> reproducible, maybe you can have a try.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it
>>>>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM 
>>>>>>>>>>>>>>>>> memory right
>>>>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB 
>>>>>>>>>>>>>>>>> file?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on
>>>>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only 
>>>>>>>>>>>>>>>>> the reachable
>>>>>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total
>>>>>>>>>>>>>>>>> of 10 jobs with 25 slots being used.
>>>>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it
>>>>>>>>>>>>>>>>> and push it to JDBC, only 1 job of the 10 is pushing to 
>>>>>>>>>>>>>>>>> Apache Ignite
>>>>>>>>>>>>>>>>> cluster.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So
>>>>>>>>>>>>>>>>> if I run the same jobs in my dev env will I still be able to 
>>>>>>>>>>>>>>>>> see the
>>>>>>>>>>>>>>>>> similar dump? I I assume so. If not I would have to wait at a 
>>>>>>>>>>>>>>>>> low volume
>>>>>>>>>>>>>>>>> time to do it on production. Aldo if I recall the dump is as 
>>>>>>>>>>>>>>>>> big as the JVM
>>>>>>>>>>>>>>>>> memory right so if I have 10GB configed for the JVM the dump 
>>>>>>>>>>>>>>>>> will be 10GB
>>>>>>>>>>>>>>>>> file?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>>>>>> high-availability.storageDir:
>>>>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>>>>>> state.savepoints.dir:
>>>>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <
>>>>>>>>>>>>>>>>> huweihua....@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use
>>>>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal 
>>>>>>>>>>>>>>>>>> classes and
>>>>>>>>>>>>>>>>>> classloaders
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com>
>>>>>>>>>>>>>>>>>> 写道：
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace 
>>>>>>>>>>>>>>>>>> out-of-memory error
>>>>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job 
>>>>>>>>>>>>>>>>>> requires a larger
>>>>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class 
>>>>>>>>>>>>>>>>>> loading leak.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts
>>>>>>>>>>>>>>>>>> low. Now I see 85% usage. It seems to be a class loading 
>>>>>>>>>>>>>>>>>> leak at this
>>>>>>>>>>>>>>>>>> point, how can we debug this issue?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Reply via email to