Ok, I don't think I'm running user code on the job manager. Basically. I'm running a standalone cluster.
3 zookeepers 3 job managers 3 task managers. I submit my jobs via the UI. But in case I'll copy the config iver to the job managers. On Mon, May 2, 2022 at 11:00 AM Chesnay Schepler <ches...@apache.org> wrote: > There are cases where user-code is run on the JobManager. > I'm not sure whether though that applies to the JDBC sources. > > On 02/05/2022 15:45, John Smith wrote: > > Why do the JDBC jars need to be on the job manager node though? > > On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ches...@apache.org> > wrote: > >> yes. >> But if you can ensure that the driver isn't bundled by any user-jar you >> can also skip the pattern configuration step. >> >> The pattern looks correct formatting-wise; you could try whether >> com.microsoft.sqlserver.jdbc. is enough to solve the issue. >> >> On 02/05/2022 14:41, John Smith wrote: >> >> Oh, so I should copy the jars to the lib folder and >> set classloader.parent-first-patterns.additional: >> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task >> managers and job managers? >> >> Also is my pattern correct? >> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." >> >> Just to be sure I'm running a standalone cluster using zookeeper. So I >> have 3 zookeepers, 3 job managers and 3 task managers. >> >> >> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> And you do should make sure that it is set for both processes! >>> >>> On 02/05/2022 08:43, Chesnay Schepler wrote: >>> >>> The setting itself isn't taskmanager specific; it applies to both the >>> job- and taskmanager process. >>> >>> On 02/05/2022 05:29, John Smith wrote: >>> >>> Also just to be sure this is a Task Manager setting right? >>> >>> On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> I assume you will take action on your side to track and fix the doc? :) >>>> >>>> On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Ok so to summarize... >>>>> >>>>> - Build my job jar and have the JDBC driver as a compile only >>>>> dependency and copy the JDBC driver to flink lib folder. >>>>> >>>>> Or >>>>> >>>>> - Build my job jar and include JDBC driver in the shadow, plus copy >>>>> the JDBC driver in the flink lib folder, plus make an entry in config for >>>>> classloader.parent-first-patterns-additional >>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>>> >>>>> >>>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org> >>>>> wrote: >>>>> >>>>>> I think what I meant was "either add it to /lib, or [if it is already >>>>>> in /lib but also bundled in the jar] add it to the parent-first >>>>>> patterns." >>>>>> >>>>>> On 28/04/2022 15:56, Chesnay Schepler wrote: >>>>>> >>>>>> Pretty sure, even though I seemingly documented it incorrectly :) >>>>>> >>>>>> On 28/04/2022 15:49, John Smith wrote: >>>>>> >>>>>> You sure? >>>>>> >>>>>> - >>>>>> >>>>>> *JDBC*: JDBC drivers leak references outside the user code >>>>>> classloader. To ensure that these classes are only loaded once you >>>>>> should >>>>>> either add the driver jars to Flink’s lib/ folder, or add the >>>>>> driver classes to the list of parent-first loaded class via >>>>>> classloader.parent-first-patterns-additional >>>>>> >>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>>>> . >>>>>> >>>>>> It says either or >>>>>> >>>>>> >>>>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> You're misinterpreting the docs. >>>>>>> >>>>>>> The parent/child-first classloading controls where Flink looks for a >>>>>>> class *first*, specifically whether we first load from /lib or the >>>>>>> user-jar. >>>>>>> It does not allow you to load something from the user-jar in the >>>>>>> parent classloader. That's just not how it works. >>>>>>> >>>>>>> It must be in /lib. >>>>>>> >>>>>>> On 27/04/2022 04:59, John Smith wrote: >>>>>>> >>>>>>> Hi Chesnay as per the docs... >>>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ >>>>>>> >>>>>>> You can either put the jars in task manager lib folder or use >>>>>>> classloader.parent-first-patterns-additional >>>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>>>>>> >>>>>>> I prefer the latter like this: the dependency stays with the >>>>>>> user-jar and not on the task manager. >>>>>>> >>>>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in >>>>>>>> the lib folders of my task managers? >>>>>>>> >>>>>>>> And then in my job jar only include them as compile time >>>>>>>> dependencies? >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler < >>>>>>>> ches...@apache.org> wrote: >>>>>>>> >>>>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately. >>>>>>>>> >>>>>>>>> You have correctly identified your alternatives. >>>>>>>>> >>>>>>>>> You must put the jdbc driver into /lib instead. Setting only the >>>>>>>>> parent-first pattern shouldn't affect anything. >>>>>>>>> That is only relevant if something is in both in /lib and the >>>>>>>>> user-jar, telling Flink to prioritize what is in lib. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 26/04/2022 15:35, John Smith wrote: >>>>>>>>> >>>>>>>>> So I put classloader.parent-first-patterns.additional: >>>>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm >>>>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>>>>>>>> >>>>>>>>> Or it's too early to tell. >>>>>>>>> >>>>>>>>> Though now, the task managers are shutting down due to some >>>>>>>>> other failures. >>>>>>>>> >>>>>>>>> So maybe because tasks were failing and reloading often the task >>>>>>>>> manager was running out of Metspace. But now maybe it's just >>>>>>>>> cleanly shutting down. >>>>>>>>> >>>>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith < >>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as >>>>>>>>>> first class? >>>>>>>>>> >>>>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith < >>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>>>> >>>>>>>>>>> - On the Histogram, I got over 30 entries for: >>>>>>>>>>> ChildFirstClassLoader >>>>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and >>>>>>>>>>> picked "Exclude all phantom/weak/soft references" >>>>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin >>>>>>>>>>> Driver >>>>>>>>>>> >>>>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>>>>>>>> manager libs folder and my jobs make the dependencies as compile >>>>>>>>>>> only? >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>>>>>>>> yaros...@goldsky.io> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also >>>>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>>>>>>>> might be helpful (has a section on profiling, as well as >>>>>>>>>>>> classloading). >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler < >>>>>>>>>>>> ches...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the >>>>>>>>>>>>> specific steps I took to debug another leak): >>>>>>>>>>>>> >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>>>>>> >>>>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, John >>>>>>>>>>>>> >>>>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the >>>>>>>>>>>>> dump file. Check whether have too many loaded classes. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>>>>>>>> >>>>>>>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file >>>>>>>>>>>>> before. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith < >>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith < >>>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and >>>>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue >>>>>>>>>>>>>>> because >>>>>>>>>>>>>>> because the classes are unloaded correctly? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua < >>>>>>>>>>>>>>> huweihua....@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. >>>>>>>>>>>>>>>> But looking >>>>>>>>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and >>>>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to >>>>>>>>>>>>>>>> failure? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job >>>>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job >>>>>>>>>>>>>>>> because if >>>>>>>>>>>>>>>> whatever reason? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or >>>>>>>>>>>>>>>> maybe I can trick my job to fail and have the scheduler >>>>>>>>>>>>>>>> restart it. Ok let >>>>>>>>>>>>>>>> me think about this... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 < >>>>>>>>>>>>>>>> huweihua....@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be >>>>>>>>>>>>>>>>> able to see the similar dump? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think running the same job in dev should be >>>>>>>>>>>>>>>>> reproducible, maybe you can have a try. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If not I would have to wait at a low volume time to do it >>>>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>>>>>>> memory right >>>>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB >>>>>>>>>>>>>>>>> file? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on >>>>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only >>>>>>>>>>>>>>>>> the reachable >>>>>>>>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total >>>>>>>>>>>>>>>>> of 10 jobs with 25 slots being used. >>>>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it >>>>>>>>>>>>>>>>> and push it to JDBC, only 1 job of the 10 is pushing to >>>>>>>>>>>>>>>>> Apache Ignite >>>>>>>>>>>>>>>>> cluster. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So >>>>>>>>>>>>>>>>> if I run the same jobs in my dev env will I still be able to >>>>>>>>>>>>>>>>> see the >>>>>>>>>>>>>>>>> similar dump? I I assume so. If not I would have to wait at a >>>>>>>>>>>>>>>>> low volume >>>>>>>>>>>>>>>>> time to do it on production. Aldo if I recall the dump is as >>>>>>>>>>>>>>>>> big as the JVM >>>>>>>>>>>>>>>>> memory right so if I have 10GB configed for the JVM the dump >>>>>>>>>>>>>>>>> will be 10GB >>>>>>>>>>>>>>>>> file? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>>>>>>>> high-availability.storageDir: >>>>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>>>>>>>> state.checkpoints.dir: >>>>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>>>>>>>> state.savepoints.dir: >>>>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 < >>>>>>>>>>>>>>>>> huweihua....@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, John >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use >>>>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal >>>>>>>>>>>>>>>>>> classes and >>>>>>>>>>>>>>>>>> classloaders >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> >>>>>>>>>>>>>>>>>> 写道: >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > My tasks manager still fails with >>>>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace >>>>>>>>>>>>>>>>>> out-of-memory error >>>>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job >>>>>>>>>>>>>>>>>> requires a larger >>>>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class >>>>>>>>>>>>>>>>>> loading leak. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts >>>>>>>>>>>>>>>>>> low. Now I see 85% usage. It seems to be a class loading >>>>>>>>>>>>>>>>>> leak at this >>>>>>>>>>>>>>>>>> point, how can we debug this issue? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>>> >>> >>> >> >