Oh, so I should copy the jars to the lib folder and set classloader.parent-first-patterns.additional: "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task managers and job managers?
Also is my pattern correct? "org.apache.ignite.;com.microsoft.sqlserver.jdbc." Just to be sure I'm running a standalone cluster using zookeeper. So I have 3 zookeepers, 3 job managers and 3 task managers. On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ches...@apache.org> wrote: > And you do should make sure that it is set for both processes! > > On 02/05/2022 08:43, Chesnay Schepler wrote: > > The setting itself isn't taskmanager specific; it applies to both the job- > and taskmanager process. > > On 02/05/2022 05:29, John Smith wrote: > > Also just to be sure this is a Task Manager setting right? > > On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com> > wrote: > >> I assume you will take action on your side to track and fix the doc? :) >> >> On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com> >> wrote: >> >>> Ok so to summarize... >>> >>> - Build my job jar and have the JDBC driver as a compile only >>> dependency and copy the JDBC driver to flink lib folder. >>> >>> Or >>> >>> - Build my job jar and include JDBC driver in the shadow, plus copy the >>> JDBC driver in the flink lib folder, plus make an entry in config for >>> classloader.parent-first-patterns-additional >>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>> >>> >>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org> >>> wrote: >>> >>>> I think what I meant was "either add it to /lib, or [if it is already >>>> in /lib but also bundled in the jar] add it to the parent-first patterns." >>>> >>>> On 28/04/2022 15:56, Chesnay Schepler wrote: >>>> >>>> Pretty sure, even though I seemingly documented it incorrectly :) >>>> >>>> On 28/04/2022 15:49, John Smith wrote: >>>> >>>> You sure? >>>> >>>> - >>>> >>>> *JDBC*: JDBC drivers leak references outside the user code >>>> classloader. To ensure that these classes are only loaded once you >>>> should >>>> either add the driver jars to Flink’s lib/ folder, or add the >>>> driver classes to the list of parent-first loaded class via >>>> classloader.parent-first-patterns-additional >>>> >>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>> . >>>> >>>> It says either or >>>> >>>> >>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> >>>> wrote: >>>> >>>>> You're misinterpreting the docs. >>>>> >>>>> The parent/child-first classloading controls where Flink looks for a >>>>> class *first*, specifically whether we first load from /lib or the >>>>> user-jar. >>>>> It does not allow you to load something from the user-jar in the >>>>> parent classloader. That's just not how it works. >>>>> >>>>> It must be in /lib. >>>>> >>>>> On 27/04/2022 04:59, John Smith wrote: >>>>> >>>>> Hi Chesnay as per the docs... >>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ >>>>> >>>>> You can either put the jars in task manager lib folder or use >>>>> classloader.parent-first-patterns-additional >>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>>> >>>>> I prefer the latter like this: the dependency stays with the user-jar >>>>> and not on the task manager. >>>>> >>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> >>>>> wrote: >>>>> >>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the >>>>>> lib folders of my task managers? >>>>>> >>>>>> And then in my job jar only include them as compile time >>>>>> dependencies? >>>>>> >>>>>> >>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately. >>>>>>> >>>>>>> You have correctly identified your alternatives. >>>>>>> >>>>>>> You must put the jdbc driver into /lib instead. Setting only the >>>>>>> parent-first pattern shouldn't affect anything. >>>>>>> That is only relevant if something is in both in /lib and the >>>>>>> user-jar, telling Flink to prioritize what is in lib. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 26/04/2022 15:35, John Smith wrote: >>>>>>> >>>>>>> So I put classloader.parent-first-patterns.additional: >>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm >>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>>>>>> >>>>>>> Or it's too early to tell. >>>>>>> >>>>>>> Though now, the task managers are shutting down due to some >>>>>>> other failures. >>>>>>> >>>>>>> So maybe because tasks were failing and reloading often the task >>>>>>> manager was running out of Metspace. But now maybe it's just >>>>>>> cleanly shutting down. >>>>>>> >>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as >>>>>>>> first class? >>>>>>>> >>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>> >>>>>>>>> - On the Histogram, I got over 30 entries for: >>>>>>>>> ChildFirstClassLoader >>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and >>>>>>>>> picked "Exclude all phantom/weak/soft references" >>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin >>>>>>>>> Driver >>>>>>>>> >>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>>>>>> manager libs folder and my jobs make the dependencies as compile only? >>>>>>>>> >>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>>>>>> yaros...@goldsky.io> wrote: >>>>>>>>> >>>>>>>>>> Also >>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>>>>>> might be helpful (has a section on profiling, as well as >>>>>>>>>> classloading). >>>>>>>>>> >>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler < >>>>>>>>>> ches...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific >>>>>>>>>>> steps I took to debug another leak): >>>>>>>>>>> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>>>> >>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, John >>>>>>>>>>> >>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump >>>>>>>>>>> file. Check whether have too many loaded classes. >>>>>>>>>>> >>>>>>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>>>>>> >>>>>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>> >>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file >>>>>>>>>>> before. >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith < >>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith < >>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and >>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue >>>>>>>>>>>>> because >>>>>>>>>>>>> because the classes are unloaded correctly? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua < >>>>>>>>>>>>> huweihua....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. >>>>>>>>>>>>>> But looking >>>>>>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and >>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to >>>>>>>>>>>>>> failure? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job >>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job >>>>>>>>>>>>>> because if >>>>>>>>>>>>>> whatever reason? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe >>>>>>>>>>>>>> I can trick my job to fail and have the scheduler restart it. Ok >>>>>>>>>>>>>> let me >>>>>>>>>>>>>> think about this... >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able >>>>>>>>>>>>>>> to see the similar dump? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think running the same job in dev should be reproducible, >>>>>>>>>>>>>>> maybe you can have a try. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If not I would have to wait at a low volume time to do it >>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>>>>> memory right >>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB >>>>>>>>>>>>>>> file? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on >>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only >>>>>>>>>>>>>>> the reachable >>>>>>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total of >>>>>>>>>>>>>>> 10 jobs with 25 slots being used. >>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and >>>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache >>>>>>>>>>>>>>> Ignite cluster. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if >>>>>>>>>>>>>>> I run the same jobs in my dev env will I still be able to see >>>>>>>>>>>>>>> the similar >>>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low >>>>>>>>>>>>>>> volume time to do >>>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the >>>>>>>>>>>>>>> JVM memory >>>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be >>>>>>>>>>>>>>> 10GB file? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>>>>>> high-availability.storageDir: >>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>>>>>> state.checkpoints.dir: >>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>>>>>> state.savepoints.dir: >>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, John >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use >>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal >>>>>>>>>>>>>>>> classes and >>>>>>>>>>>>>>>> classloaders >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > My tasks manager still fails with >>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace >>>>>>>>>>>>>>>> out-of-memory error >>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job >>>>>>>>>>>>>>>> requires a larger >>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class >>>>>>>>>>>>>>>> loading leak. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. >>>>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at >>>>>>>>>>>>>>>> this point, how >>>>>>>>>>>>>>>> can we debug this issue? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>> >>>> >>>> > >