Or I can put in the config to treat org.apache.ignite. classes as first class?
On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> wrote: > Ok, so I loaded the dump into Eclipse Mat and followed: > https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks > > - On the Histogram, I got over 30 entries for: ChildFirstClassLoader > - Then I clicked on one of them "Merge Shortest Path..." and picked > "Exclude all phantom/weak/soft references" > - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver > > So i'm guessing anything JDBC based. I should copy into the task manager > libs folder and my jobs make the dependencies as compile only? > > On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <yaros...@goldsky.io> > wrote: > >> Also >> https://shopify.engineering/optimizing-apache-flink-applications-tips >> might be helpful (has a section on profiling, as well as classloading). >> >> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> We have a very rough "guide" in the wiki (it's just the specific steps I >>> took to debug another leak): >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>> >>> On 19/04/2022 12:01, huweihua wrote: >>> >>> Hi, John >>> >>> Sorry for the late reply. You can use MAT[1] to analyze the dump file. >>> Check whether have too many loaded classes. >>> >>> [1] https://www.eclipse.org/mat/ >>> >>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>> >>> Hi, can anyone help with this? I never looked at a dump file before. >>> >>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> Hi, so I have a dump file. What do I look for? >>>> >>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Ok so if there's a leak, if I manually stop the job and restart it >>>>> from the UI multiple times, I won't see the issue because because the >>>>> classes are unloaded correctly? >>>>> >>>>> >>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> The difference is that manually canceling the job stops the >>>>>> JobMaster, but automatic failover keeps the JobMaster running. But >>>>>> looking >>>>>> on TaskManager, it doesn't make much difference >>>>>> >>>>>> >>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>> >>>>>> Also if I manually cancel and restart the same job over and over is >>>>>> it the same as if flink was restarting a job due to failure? >>>>>> >>>>>> I.e: When I click "Cancel Job" on the UI is the job completely >>>>>> unloaded vs when the job scheduler restarts a job because if whatever >>>>>> reason? >>>>>> >>>>>> Lile this I'll stop and restart the job a few times or maybe I can >>>>>> trick my job to fail and have the scheduler restart it. Ok let me think >>>>>> about this... >>>>>> >>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> wrote: >>>>>> >>>>>>> So if I run the same jobs in my dev env will I still be able to see >>>>>>> the similar dump? >>>>>>> >>>>>>> I think running the same job in dev should be reproducible, maybe >>>>>>> you can have a try. >>>>>>> >>>>>>> If not I would have to wait at a low volume time to do it on >>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right >>>>>>> so >>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>> >>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size >>>>>>> to dump. you can use "jmap -dump:live" to dump only the reachable >>>>>>> objects, >>>>>>> this will take a brief pause >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>> >>>>>>> I have 3 task managers (see config below). There is total of 10 jobs >>>>>>> with 25 slots being used. >>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it >>>>>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster. >>>>>>> >>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run >>>>>>> the same jobs in my dev env will I still be able to see the similar >>>>>>> dump? I >>>>>>> I assume so. If not I would have to wait at a low volume time to do it >>>>>>> on >>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right >>>>>>> so >>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>> >>>>>>> >>>>>>> # Operating system has 16GB total. >>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>> >>>>>>> cluster.evenly-spread-out-slots: true >>>>>>> >>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>> parallelism.default: 1 >>>>>>> >>>>>>> high-availability: zookeeper >>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>>>>>> high-availability.zookeeper.quorum: ... >>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>> >>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>> >>>>>>> state.backend: rocksdb >>>>>>> state.backend.incremental: true >>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 >>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>>>> >>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, John >>>>>>>> >>>>>>>> Could you tell us you application scenario? Is it a flink session >>>>>>>> cluster with a lot of jobs? >>>>>>>> >>>>>>>> Maybe you can try to dump the memory with jmap and use tools such >>>>>>>> as MAT to analyze whether there are abnormal classes and classloaders >>>>>>>> >>>>>>>> >>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>> > >>>>>>>> > Hi running 1.14.4 >>>>>>>> > >>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError: >>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can >>>>>>>> mean >>>>>>>> two things: either the job requires a larger size of JVM metaspace to >>>>>>>> load >>>>>>>> classes or there is a class loading leak. >>>>>>>> > >>>>>>>> > I have 2GB of metaspace configed >>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>> > >>>>>>>> > But the task nodes still fail. >>>>>>>> > >>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I >>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how >>>>>>>> can >>>>>>>> we debug this issue? >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >>>