Also just to be sure this is a Task Manager setting right? On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com> wrote:
> I assume you will take action on your side to track and fix the doc? :) > > On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com> > wrote: > >> Ok so to summarize... >> >> - Build my job jar and have the JDBC driver as a compile only >> dependency and copy the JDBC driver to flink lib folder. >> >> Or >> >> - Build my job jar and include JDBC driver in the shadow, plus copy the >> JDBC driver in the flink lib folder, plus make an entry in config for >> classloader.parent-first-patterns-additional >> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >> >> >> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> I think what I meant was "either add it to /lib, or [if it is already in >>> /lib but also bundled in the jar] add it to the parent-first patterns." >>> >>> On 28/04/2022 15:56, Chesnay Schepler wrote: >>> >>> Pretty sure, even though I seemingly documented it incorrectly :) >>> >>> On 28/04/2022 15:49, John Smith wrote: >>> >>> You sure? >>> >>> - >>> >>> *JDBC*: JDBC drivers leak references outside the user code >>> classloader. To ensure that these classes are only loaded once you should >>> either add the driver jars to Flink’s lib/ folder, or add the driver >>> classes to the list of parent-first loaded class via >>> classloader.parent-first-patterns-additional >>> >>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>> . >>> >>> It says either or >>> >>> >>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> >>> wrote: >>> >>>> You're misinterpreting the docs. >>>> >>>> The parent/child-first classloading controls where Flink looks for a >>>> class *first*, specifically whether we first load from /lib or the >>>> user-jar. >>>> It does not allow you to load something from the user-jar in the parent >>>> classloader. That's just not how it works. >>>> >>>> It must be in /lib. >>>> >>>> On 27/04/2022 04:59, John Smith wrote: >>>> >>>> Hi Chesnay as per the docs... >>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ >>>> >>>> You can either put the jars in task manager lib folder or use >>>> classloader.parent-first-patterns-additional >>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>> >>>> I prefer the latter like this: the dependency stays with the user-jar >>>> and not on the task manager. >>>> >>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the >>>>> lib folders of my task managers? >>>>> >>>>> And then in my job jar only include them as compile time dependencies? >>>>> >>>>> >>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org> >>>>> wrote: >>>>> >>>>>> JDBC drivers are well-known for leaking classloaders unfortunately. >>>>>> >>>>>> You have correctly identified your alternatives. >>>>>> >>>>>> You must put the jdbc driver into /lib instead. Setting only the >>>>>> parent-first pattern shouldn't affect anything. >>>>>> That is only relevant if something is in both in /lib and the >>>>>> user-jar, telling Flink to prioritize what is in lib. >>>>>> >>>>>> >>>>>> >>>>>> On 26/04/2022 15:35, John Smith wrote: >>>>>> >>>>>> So I put classloader.parent-first-patterns.additional: >>>>>> "org.apache.ignite." in the task config and so far I don't think I'm >>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>>>>> >>>>>> Or it's too early to tell. >>>>>> >>>>>> Though now, the task managers are shutting down due to some >>>>>> other failures. >>>>>> >>>>>> So maybe because tasks were failing and reloading often the task >>>>>> manager was running out of Metspace. But now maybe it's just >>>>>> cleanly shutting down. >>>>>> >>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Or I can put in the config to treat org.apache.ignite. classes as >>>>>>> first class? >>>>>>> >>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>> >>>>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader >>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked >>>>>>>> "Exclude all phantom/weak/soft references" >>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin >>>>>>>> Driver >>>>>>>> >>>>>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>>>>> manager libs folder and my jobs make the dependencies as compile only? >>>>>>>> >>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>>>>> yaros...@goldsky.io> wrote: >>>>>>>> >>>>>>>>> Also >>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>>>>> might be helpful (has a section on profiling, as well as >>>>>>>>> classloading). >>>>>>>>> >>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler < >>>>>>>>> ches...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific >>>>>>>>>> steps I took to debug another leak): >>>>>>>>>> >>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>>> >>>>>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>>>>> >>>>>>>>>> Hi, John >>>>>>>>>> >>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump >>>>>>>>>> file. Check whether have too many loaded classes. >>>>>>>>>> >>>>>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>>>>> >>>>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>> >>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file >>>>>>>>>> before. >>>>>>>>>> >>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith < >>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith < >>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart >>>>>>>>>>>> it from the UI multiple times, I won't see the issue because >>>>>>>>>>>> because the >>>>>>>>>>>> classes are unloaded correctly? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua < >>>>>>>>>>>> huweihua....@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. >>>>>>>>>>>>> But looking >>>>>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>> >>>>>>>>>>>>> Also if I manually cancel and restart the same job over and >>>>>>>>>>>>> over is it the same as if flink was restarting a job due to >>>>>>>>>>>>> failure? >>>>>>>>>>>>> >>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely >>>>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if >>>>>>>>>>>>> whatever >>>>>>>>>>>>> reason? >>>>>>>>>>>>> >>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I >>>>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok >>>>>>>>>>>>> let me think >>>>>>>>>>>>> about this... >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able >>>>>>>>>>>>>> to see the similar dump? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think running the same job in dev should be reproducible, >>>>>>>>>>>>>> maybe you can have a try. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If not I would have to wait at a low volume time to do it on >>>>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>>>> memory right so >>>>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on >>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only the >>>>>>>>>>>>>> reachable >>>>>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have 3 task managers (see config below). There is total of >>>>>>>>>>>>>> 10 jobs with 25 slots being used. >>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and >>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache >>>>>>>>>>>>>> Ignite cluster. >>>>>>>>>>>>>> >>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I >>>>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the >>>>>>>>>>>>>> similar >>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume >>>>>>>>>>>>>> time to do >>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>>>> memory >>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be >>>>>>>>>>>>>> 10GB file? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>>>>> >>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>>>>> >>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>>>>> high-availability.storageDir: >>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>>>>> >>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>>>>> >>>>>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>>>>> state.checkpoints.dir: >>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, John >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools >>>>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and >>>>>>>>>>>>>>> classloaders >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > My tasks manager still fails with >>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace >>>>>>>>>>>>>>> out-of-memory error >>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires >>>>>>>>>>>>>>> a larger >>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class >>>>>>>>>>>>>>> loading leak. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. >>>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at >>>>>>>>>>>>>>> this point, how >>>>>>>>>>>>>>> can we debug this issue? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>> >>>> >>> >>>