Pretty sure, even though I seemingly documented it incorrectly :)
On 28/04/2022 15:49, John Smith wrote:
You sure? * /JDBC/: JDBC drivers leak references outside the user code classloader. To ensure that these classes are only loaded once you should either add the driver jars to Flink’s |lib/| folder, or add the driver classes to the list of parent-first loaded class via |classloader.parent-first-patterns-additional| <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>. It says either orOn Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <[email protected]> wrote:You're misinterpreting the docs. The parent/child-first classloading controls where Flink looks for a class /first/, specifically whether we first load from /lib or the user-jar. It does not allow you to load something from the user-jar in the parent classloader. That's just not how it works. It must be in /lib. On 27/04/2022 04:59, John Smith wrote:Hi Chesnay as per the docs... https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ You can either put the jars in task manager lib folder or use |classloader.parent-first-patterns-additional| <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> I prefer the latter like this: the dependency stays with the user-jar and not on the task manager. On Tue, Apr 26, 2022 at 9:52 PM John Smith <[email protected]> wrote: Ok so I should put the Apache ignite and my Microsoft drivers in the lib folders of my task managers? And then in my job jar only include them as compile time dependencies? On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <[email protected]> wrote: JDBC drivers are well-known for leaking classloaders unfortunately. You have correctly identified your alternatives. You must put the jdbc driver into /lib instead. Setting only the parent-first pattern shouldn't affect anything. That is only relevant if something is in both in /lib and the user-jar, telling Flink to prioritize what is in lib. On 26/04/2022 15:35, John Smith wrote:So I put classloader.parent-first-patterns.additional: "org.apache.ignite." in the task config and so far I don't think I'm getting "java.lang.OutOfMemoryError: Metaspace" any more. Or it's too early to tell. Though now, the task managers are shutting down due to some other failures. So maybe because tasks were failing and reloading often the task manager was running out of Metspace. But now maybe it's just cleanly shutting down. On Wed, Apr 20, 2022 at 11:35 AM John Smith <[email protected]> wrote: Or I can put in the config to treat org.apache.ignite. classes as first class? On Tue, Apr 19, 2022 at 10:18 PM John Smith <[email protected]> wrote: Ok, so I loaded the dump into Eclipse Mat and followed: https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks - On the Histogram, I got over 30 entries for: ChildFirstClassLoader - Then I clicked on one of them "Merge Shortest Path..." and picked "Exclude all phantom/weak/soft references" - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver So i'm guessing anything JDBC based. I should copy into the task manager libs folder and my jobs make the dependencies as compile only? On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <[email protected]> wrote: Also https://shopify.engineering/optimizing-apache-flink-applications-tips might be helpful (has a section on profiling, as well as classloading). On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <[email protected]> wrote: We have a very rough "guide" in the wiki (it's just the specific steps I took to debug another leak): https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks On 19/04/2022 12:01, huweihua wrote:Hi, John Sorry for the late reply. You can use MAT[1] to analyze the dump file. Check whether have too many loaded classes. [1] https://www.eclipse.org/mat/2022年4月18日 下午9:55,John Smith <[email protected]> 写道: Hi, can anyone help with this? I never looked at a dump file before. On Thu, Apr 14, 2022 at 11:59 AM John Smith <[email protected]> wrote: Hi, so I have a dump file. What do I look for? On Thu, Mar 31, 2022 at 3:28 PM John Smith <[email protected]> wrote: Ok so if there's a leak, if I manually stop the job and restart it from the UI multiple times, I won't see the issue because because the classes are unloaded correctly? On Thu, Mar 31, 2022 at 9:20 AM huweihua <[email protected]> wrote: The difference is that manually canceling the job stops the JobMaster, but automatic failover keeps the JobMaster running. But looking on TaskManager, it doesn't make much difference2022年3月31日 上午4:01,John Smith <[email protected]> 写道: Also if I manually cancel and restart the same job over and over is it the same as if flink was restarting a job due to failure? I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs when the job scheduler restarts a job because if whatever reason? Lile this I'll stop and restart the job a few times or maybe I can trick my job to fail and have the scheduler restart it. Ok let me think about this... On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <[email protected]> wrote:So if I run the same jobs in my dev env will I still be able to see the similar dump?I think running the same job in dev should be reproducible, maybe you can have a try.If not I would have to wait at a low volume time to do it on production. Aldo if I recall the dump is as big as the JVM memory right so if I have 10GB configed for the JVM the dump will be 10GB file?Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. you can use "jmap -dump:live" to dump only the reachable objects, this will take a brief pause2022年3月30日 下午9:47,John Smith <[email protected]> 写道: I have 3 task managers (see config below). There is total of 10 jobs with 25 slots being used. The jobs are 100% ETL I.e; They load Json, transform it and push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster. FOR JMAP. I know that it will pause the task manager. So if I run the same jobs in my dev env will I still be able to see the similar dump? I I assume so. If not I would have to wait at a low volume time to do it on production. Aldo if I recall the dump is as big as the JVM memory right so if I have 10GB configed for the JVM the dump will be 10GB file? # Operating system has 16GB total. env.ssh.opts: -l flink -oStrictHostKeyChecking=no cluster.evenly-spread-out-slots: true taskmanager.memory.flink.size: 10240m taskmanager.memory.jvm-metaspace.size: 2048m taskmanager.numberOfTaskSlots: 16 parallelism.default: 1 high-availability: zookeeper high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ high-availability.zookeeper.quorum: ... high-availability.zookeeper.path.root: /flink_1_14 high-availability.cluster-id: /flink_1_14_cluster_0001 web.upload.dir: /mnt/flink/uploads/flink_1_14 state.backend: rocksdb state.backend.incremental: true state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <[email protected]> wrote: Hi, John Could you tell us you application scenario? Is it a flink session cluster with a lot of jobs? Maybe you can try to dump the memory with jmap and use tools such as MAT to analyze whether there are abnormal classes and classloaders > 2022年3月30日 上午6:09,John Smith <[email protected]> 写道: > > Hi running 1.14.4 > > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. > > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 2048m > > But the task nodes still fail. > > When looking at the UI metrics, the metaspace starts low. Now I see 85% usage. It seems to be a class loading leak at this point, how can we debug this issue?
