Why do the JDBC jars need to be on the job manager node though? On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ches...@apache.org> wrote:
> yes. > But if you can ensure that the driver isn't bundled by any user-jar you > can also skip the pattern configuration step. > > The pattern looks correct formatting-wise; you could try whether > com.microsoft.sqlserver.jdbc. is enough to solve the issue. > > On 02/05/2022 14:41, John Smith wrote: > > Oh, so I should copy the jars to the lib folder and > set classloader.parent-first-patterns.additional: > "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task > managers and job managers? > > Also is my pattern correct? > "org.apache.ignite.;com.microsoft.sqlserver.jdbc." > > Just to be sure I'm running a standalone cluster using zookeeper. So I > have 3 zookeepers, 3 job managers and 3 task managers. > > > On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ches...@apache.org> > wrote: > >> And you do should make sure that it is set for both processes! >> >> On 02/05/2022 08:43, Chesnay Schepler wrote: >> >> The setting itself isn't taskmanager specific; it applies to both the >> job- and taskmanager process. >> >> On 02/05/2022 05:29, John Smith wrote: >> >> Also just to be sure this is a Task Manager setting right? >> >> On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com> >> wrote: >> >>> I assume you will take action on your side to track and fix the doc? :) >>> >>> On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> Ok so to summarize... >>>> >>>> - Build my job jar and have the JDBC driver as a compile only >>>> dependency and copy the JDBC driver to flink lib folder. >>>> >>>> Or >>>> >>>> - Build my job jar and include JDBC driver in the shadow, plus copy the >>>> JDBC driver in the flink lib folder, plus make an entry in config for >>>> classloader.parent-first-patterns-additional >>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>> >>>> >>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org> >>>> wrote: >>>> >>>>> I think what I meant was "either add it to /lib, or [if it is already >>>>> in /lib but also bundled in the jar] add it to the parent-first patterns." >>>>> >>>>> On 28/04/2022 15:56, Chesnay Schepler wrote: >>>>> >>>>> Pretty sure, even though I seemingly documented it incorrectly :) >>>>> >>>>> On 28/04/2022 15:49, John Smith wrote: >>>>> >>>>> You sure? >>>>> >>>>> - >>>>> >>>>> *JDBC*: JDBC drivers leak references outside the user code >>>>> classloader. To ensure that these classes are only loaded once you >>>>> should >>>>> either add the driver jars to Flink’s lib/ folder, or add the >>>>> driver classes to the list of parent-first loaded class via >>>>> classloader.parent-first-patterns-additional >>>>> >>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>>> . >>>>> >>>>> It says either or >>>>> >>>>> >>>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> >>>>> wrote: >>>>> >>>>>> You're misinterpreting the docs. >>>>>> >>>>>> The parent/child-first classloading controls where Flink looks for a >>>>>> class *first*, specifically whether we first load from /lib or the >>>>>> user-jar. >>>>>> It does not allow you to load something from the user-jar in the >>>>>> parent classloader. That's just not how it works. >>>>>> >>>>>> It must be in /lib. >>>>>> >>>>>> On 27/04/2022 04:59, John Smith wrote: >>>>>> >>>>>> Hi Chesnay as per the docs... >>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ >>>>>> >>>>>> You can either put the jars in task manager lib folder or use >>>>>> classloader.parent-first-patterns-additional >>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>>>> >>>>>> I prefer the latter like this: the dependency stays with the user-jar >>>>>> and not on the task manager. >>>>>> >>>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the >>>>>>> lib folders of my task managers? >>>>>>> >>>>>>> And then in my job jar only include them as compile time >>>>>>> dependencies? >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler < >>>>>>> ches...@apache.org> wrote: >>>>>>> >>>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately. >>>>>>>> >>>>>>>> You have correctly identified your alternatives. >>>>>>>> >>>>>>>> You must put the jdbc driver into /lib instead. Setting only the >>>>>>>> parent-first pattern shouldn't affect anything. >>>>>>>> That is only relevant if something is in both in /lib and the >>>>>>>> user-jar, telling Flink to prioritize what is in lib. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 26/04/2022 15:35, John Smith wrote: >>>>>>>> >>>>>>>> So I put classloader.parent-first-patterns.additional: >>>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm >>>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>>>>>>> >>>>>>>> Or it's too early to tell. >>>>>>>> >>>>>>>> Though now, the task managers are shutting down due to some >>>>>>>> other failures. >>>>>>>> >>>>>>>> So maybe because tasks were failing and reloading often the task >>>>>>>> manager was running out of Metspace. But now maybe it's just >>>>>>>> cleanly shutting down. >>>>>>>> >>>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as >>>>>>>>> first class? >>>>>>>>> >>>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith < >>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>>> >>>>>>>>>> - On the Histogram, I got over 30 entries for: >>>>>>>>>> ChildFirstClassLoader >>>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and >>>>>>>>>> picked "Exclude all phantom/weak/soft references" >>>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin >>>>>>>>>> Driver >>>>>>>>>> >>>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>>>>>>> manager libs folder and my jobs make the dependencies as compile >>>>>>>>>> only? >>>>>>>>>> >>>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>>>>>>> yaros...@goldsky.io> wrote: >>>>>>>>>> >>>>>>>>>>> Also >>>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>>>>>>> might be helpful (has a section on profiling, as well as >>>>>>>>>>> classloading). >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler < >>>>>>>>>>> ches...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the >>>>>>>>>>>> specific steps I took to debug another leak): >>>>>>>>>>>> >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>>>>> >>>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, John >>>>>>>>>>>> >>>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the >>>>>>>>>>>> dump file. Check whether have too many loaded classes. >>>>>>>>>>>> >>>>>>>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>>>>>>> >>>>>>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>> >>>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file >>>>>>>>>>>> before. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith < >>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith < >>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and >>>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue >>>>>>>>>>>>>> because >>>>>>>>>>>>>> because the classes are unloaded correctly? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua < >>>>>>>>>>>>>> huweihua....@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. >>>>>>>>>>>>>>> But looking >>>>>>>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and >>>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to >>>>>>>>>>>>>>> failure? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job >>>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job >>>>>>>>>>>>>>> because if >>>>>>>>>>>>>>> whatever reason? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe >>>>>>>>>>>>>>> I can trick my job to fail and have the scheduler restart it. >>>>>>>>>>>>>>> Ok let me >>>>>>>>>>>>>>> think about this... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be >>>>>>>>>>>>>>>> able to see the similar dump? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think running the same job in dev should be reproducible, >>>>>>>>>>>>>>>> maybe you can have a try. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If not I would have to wait at a low volume time to do it >>>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>>>>>> memory right >>>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB >>>>>>>>>>>>>>>> file? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on >>>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only >>>>>>>>>>>>>>>> the reachable >>>>>>>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total >>>>>>>>>>>>>>>> of 10 jobs with 25 slots being used. >>>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and >>>>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache >>>>>>>>>>>>>>>> Ignite cluster. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if >>>>>>>>>>>>>>>> I run the same jobs in my dev env will I still be able to see >>>>>>>>>>>>>>>> the similar >>>>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low >>>>>>>>>>>>>>>> volume time to do >>>>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the >>>>>>>>>>>>>>>> JVM memory >>>>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be >>>>>>>>>>>>>>>> 10GB file? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>>>>>>> high-availability.storageDir: >>>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>>>>>>> state.checkpoints.dir: >>>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>>>>>>> state.savepoints.dir: >>>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, John >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use >>>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal >>>>>>>>>>>>>>>>> classes and >>>>>>>>>>>>>>>>> classloaders >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> >>>>>>>>>>>>>>>>> 写道: >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > My tasks manager still fails with >>>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace >>>>>>>>>>>>>>>>> out-of-memory error >>>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job >>>>>>>>>>>>>>>>> requires a larger >>>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class >>>>>>>>>>>>>>>>> loading leak. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts >>>>>>>>>>>>>>>>> low. Now I see 85% usage. It seems to be a class loading leak >>>>>>>>>>>>>>>>> at this >>>>>>>>>>>>>>>>> point, how can we debug this issue? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >> >> >