You sure? -
*JDBC*: JDBC drivers leak references outside the user code classloader. To ensure that these classes are only loaded once you should either add the driver jars to Flink’s lib/ folder, or add the driver classes to the list of parent-first loaded class via classloader.parent-first-patterns-additional <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> . It says either or On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> wrote: > You're misinterpreting the docs. > > The parent/child-first classloading controls where Flink looks for a class > *first*, specifically whether we first load from /lib or the user-jar. > It does not allow you to load something from the user-jar in the parent > classloader. That's just not how it works. > > It must be in /lib. > > On 27/04/2022 04:59, John Smith wrote: > > Hi Chesnay as per the docs... > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ > > You can either put the jars in task manager lib folder or use > classloader.parent-first-patterns-additional > <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> > > I prefer the latter like this: the dependency stays with the user-jar and > not on the task manager. > > On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> wrote: > >> Ok so I should put the Apache ignite and my Microsoft drivers in the lib >> folders of my task managers? >> >> And then in my job jar only include them as compile time dependencies? >> >> >> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> JDBC drivers are well-known for leaking classloaders unfortunately. >>> >>> You have correctly identified your alternatives. >>> >>> You must put the jdbc driver into /lib instead. Setting only the >>> parent-first pattern shouldn't affect anything. >>> That is only relevant if something is in both in /lib and the user-jar, >>> telling Flink to prioritize what is in lib. >>> >>> >>> >>> On 26/04/2022 15:35, John Smith wrote: >>> >>> So I put classloader.parent-first-patterns.additional: >>> "org.apache.ignite." in the task config and so far I don't think I'm >>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>> >>> Or it's too early to tell. >>> >>> Though now, the task managers are shutting down due to some >>> other failures. >>> >>> So maybe because tasks were failing and reloading often the task manager >>> was running out of Metspace. But now maybe it's just cleanly shutting down. >>> >>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> Or I can put in the config to treat org.apache.ignite. classes as first >>>> class? >>>> >>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>> >>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader >>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked >>>>> "Exclude all phantom/weak/soft references" >>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver >>>>> >>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>> manager libs folder and my jobs make the dependencies as compile only? >>>>> >>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>> yaros...@goldsky.io> wrote: >>>>> >>>>>> Also >>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>> might be helpful (has a section on profiling, as well as classloading). >>>>>> >>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> We have a very rough "guide" in the wiki (it's just the specific >>>>>>> steps I took to debug another leak): >>>>>>> >>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>> >>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>> >>>>>>> Hi, John >>>>>>> >>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump >>>>>>> file. Check whether have too many loaded classes. >>>>>>> >>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>> >>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>> >>>>>>> Hi, can anyone help with this? I never looked at a dump file before. >>>>>>> >>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>> >>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it >>>>>>>>> from the UI multiple times, I won't see the issue because because the >>>>>>>>> classes are unloaded correctly? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But >>>>>>>>>> looking >>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>> >>>>>>>>>> Also if I manually cancel and restart the same job over and over >>>>>>>>>> is it the same as if flink was restarting a job due to failure? >>>>>>>>>> >>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely >>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever >>>>>>>>>> reason? >>>>>>>>>> >>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I >>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let >>>>>>>>>> me think >>>>>>>>>> about this... >>>>>>>>>> >>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to >>>>>>>>>>> see the similar dump? >>>>>>>>>>> >>>>>>>>>>> I think running the same job in dev should be reproducible, >>>>>>>>>>> maybe you can have a try. >>>>>>>>>>> >>>>>>>>>>> If not I would have to wait at a low volume time to do it on >>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory >>>>>>>>>>> right so >>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>>>>>> >>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the >>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the >>>>>>>>>>> reachable >>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>> >>>>>>>>>>> I have 3 task managers (see config below). There is total of 10 >>>>>>>>>>> jobs with 25 slots being used. >>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push >>>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite >>>>>>>>>>> cluster. >>>>>>>>>>> >>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I >>>>>>>>>>> run the same jobs in my dev env will I still be able to see the >>>>>>>>>>> similar >>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume >>>>>>>>>>> time to do >>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>> memory >>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB >>>>>>>>>>> file? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>> >>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>> >>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>> >>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>> >>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>> >>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, John >>>>>>>>>>>> >>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>> >>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools >>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and >>>>>>>>>>>> classloaders >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>> > >>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>> > >>>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError: >>>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This >>>>>>>>>>>> can mean >>>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace >>>>>>>>>>>> to load >>>>>>>>>>>> classes or there is a class loading leak. >>>>>>>>>>>> > >>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>> > >>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>> > >>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now >>>>>>>>>>>> I see 85% usage. It seems to be a class loading leak at this >>>>>>>>>>>> point, how can >>>>>>>>>>>> we debug this issue? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>>> >>> >