Ok so to summarize... - Build my job jar and have the JDBC driver as a compile only dependency and copy the JDBC driver to flink lib folder.
Or - Build my job jar and include JDBC driver in the shadow, plus copy the JDBC driver in the flink lib folder, plus make an entry in config for classloader.parent-first-patterns-additional <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org> wrote: > I think what I meant was "either add it to /lib, or [if it is already in > /lib but also bundled in the jar] add it to the parent-first patterns." > > On 28/04/2022 15:56, Chesnay Schepler wrote: > > Pretty sure, even though I seemingly documented it incorrectly :) > > On 28/04/2022 15:49, John Smith wrote: > > You sure? > > - > > *JDBC*: JDBC drivers leak references outside the user code > classloader. To ensure that these classes are only loaded once you should > either add the driver jars to Flink’s lib/ folder, or add the driver > classes to the list of parent-first loaded class via > classloader.parent-first-patterns-additional > > <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> > . > > It says either or > > > On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> > wrote: > >> You're misinterpreting the docs. >> >> The parent/child-first classloading controls where Flink looks for a >> class *first*, specifically whether we first load from /lib or the >> user-jar. >> It does not allow you to load something from the user-jar in the parent >> classloader. That's just not how it works. >> >> It must be in /lib. >> >> On 27/04/2022 04:59, John Smith wrote: >> >> Hi Chesnay as per the docs... >> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ >> >> You can either put the jars in task manager lib folder or use >> classloader.parent-first-patterns-additional >> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >> >> I prefer the latter like this: the dependency stays with the user-jar and >> not on the task manager. >> >> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> >> wrote: >> >>> Ok so I should put the Apache ignite and my Microsoft drivers in the lib >>> folders of my task managers? >>> >>> And then in my job jar only include them as compile time dependencies? >>> >>> >>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org> >>> wrote: >>> >>>> JDBC drivers are well-known for leaking classloaders unfortunately. >>>> >>>> You have correctly identified your alternatives. >>>> >>>> You must put the jdbc driver into /lib instead. Setting only the >>>> parent-first pattern shouldn't affect anything. >>>> That is only relevant if something is in both in /lib and the user-jar, >>>> telling Flink to prioritize what is in lib. >>>> >>>> >>>> >>>> On 26/04/2022 15:35, John Smith wrote: >>>> >>>> So I put classloader.parent-first-patterns.additional: >>>> "org.apache.ignite." in the task config and so far I don't think I'm >>>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>>> >>>> Or it's too early to tell. >>>> >>>> Though now, the task managers are shutting down due to some >>>> other failures. >>>> >>>> So maybe because tasks were failing and reloading often the task >>>> manager was running out of Metspace. But now maybe it's just >>>> cleanly shutting down. >>>> >>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Or I can put in the config to treat org.apache.ignite. classes as >>>>> first class? >>>>> >>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> >>>>> wrote: >>>>> >>>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>> >>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader >>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked >>>>>> "Exclude all phantom/weak/soft references" >>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin >>>>>> Driver >>>>>> >>>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>>> manager libs folder and my jobs make the dependencies as compile only? >>>>>> >>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>>> yaros...@goldsky.io> wrote: >>>>>> >>>>>>> Also >>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>>> might be helpful (has a section on profiling, as well as classloading). >>>>>>> >>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> We have a very rough "guide" in the wiki (it's just the specific >>>>>>>> steps I took to debug another leak): >>>>>>>> >>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>> >>>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>>> >>>>>>>> Hi, John >>>>>>>> >>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump >>>>>>>> file. Check whether have too many loaded classes. >>>>>>>> >>>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>>> >>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>>> >>>>>>>> Hi, can anyone help with this? I never looked at a dump file before. >>>>>>>> >>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>>> >>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart >>>>>>>>>> it from the UI multiple times, I won't see the issue because because >>>>>>>>>> the >>>>>>>>>> classes are unloaded correctly? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But >>>>>>>>>>> looking >>>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>> >>>>>>>>>>> Also if I manually cancel and restart the same job over and over >>>>>>>>>>> is it the same as if flink was restarting a job due to failure? >>>>>>>>>>> >>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely >>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if >>>>>>>>>>> whatever >>>>>>>>>>> reason? >>>>>>>>>>> >>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I >>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let >>>>>>>>>>> me think >>>>>>>>>>> about this... >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to >>>>>>>>>>>> see the similar dump? >>>>>>>>>>>> >>>>>>>>>>>> I think running the same job in dev should be reproducible, >>>>>>>>>>>> maybe you can have a try. >>>>>>>>>>>> >>>>>>>>>>>> If not I would have to wait at a low volume time to do it on >>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory >>>>>>>>>>>> right so >>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>>>>>>> >>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the >>>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the >>>>>>>>>>>> reachable >>>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>> >>>>>>>>>>>> I have 3 task managers (see config below). There is total of 10 >>>>>>>>>>>> jobs with 25 slots being used. >>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and >>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite >>>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I >>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the >>>>>>>>>>>> similar >>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume >>>>>>>>>>>> time to do >>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>> memory >>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB >>>>>>>>>>>> file? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>>> >>>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>>> >>>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>>> >>>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>>> >>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>>> >>>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, John >>>>>>>>>>>>> >>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools >>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and >>>>>>>>>>>>> classloaders >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>>> > >>>>>>>>>>>>> > My tasks manager still fails with >>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace >>>>>>>>>>>>> out-of-memory error >>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a >>>>>>>>>>>>> larger >>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading >>>>>>>>>>>>> leak. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>> > >>>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>>> > >>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. >>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at this >>>>>>>>>>>>> point, how >>>>>>>>>>>>> can we debug this issue? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>> >>>> >> > >