Also https://shopify.engineering/optimizing-apache-flink-applications-tips might be helpful (has a section on profiling, as well as classloading).
On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org> wrote: > We have a very rough "guide" in the wiki (it's just the specific steps I > took to debug another leak): > > https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks > > On 19/04/2022 12:01, huweihua wrote: > > Hi, John > > Sorry for the late reply. You can use MAT[1] to analyze the dump file. > Check whether have too many loaded classes. > > [1] https://www.eclipse.org/mat/ > > 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: > > Hi, can anyone help with this? I never looked at a dump file before. > > On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com> > wrote: > >> Hi, so I have a dump file. What do I look for? >> >> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com> >> wrote: >> >>> Ok so if there's a leak, if I manually stop the job and restart it from >>> the UI multiple times, I won't see the issue because because the classes >>> are unloaded correctly? >>> >>> >>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> wrote: >>> >>>> >>>> The difference is that manually canceling the job stops the JobMaster, >>>> but automatic failover keeps the JobMaster running. But looking on >>>> TaskManager, it doesn't make much difference >>>> >>>> >>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>> >>>> Also if I manually cancel and restart the same job over and over is it >>>> the same as if flink was restarting a job due to failure? >>>> >>>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded >>>> vs when the job scheduler restarts a job because if whatever reason? >>>> >>>> Lile this I'll stop and restart the job a few times or maybe I can >>>> trick my job to fail and have the scheduler restart it. Ok let me think >>>> about this... >>>> >>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> wrote: >>>> >>>>> So if I run the same jobs in my dev env will I still be able to see >>>>> the similar dump? >>>>> >>>>> I think running the same job in dev should be reproducible, maybe you >>>>> can have a try. >>>>> >>>>> If not I would have to wait at a low volume time to do it on >>>>> production. Aldo if I recall the dump is as big as the JVM memory right so >>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>> >>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size to >>>>> dump. you can use "jmap -dump:live" to dump only the reachable objects, >>>>> this will take a brief pause >>>>> >>>>> >>>>> >>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>> >>>>> I have 3 task managers (see config below). There is total of 10 jobs >>>>> with 25 slots being used. >>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to >>>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster. >>>>> >>>>> FOR JMAP. I know that it will pause the task manager. So if I run the >>>>> same jobs in my dev env will I still be able to see the similar dump? I I >>>>> assume so. If not I would have to wait at a low volume time to do it on >>>>> production. Aldo if I recall the dump is as big as the JVM memory right so >>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>> >>>>> >>>>> # Operating system has 16GB total. >>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>> >>>>> cluster.evenly-spread-out-slots: true >>>>> >>>>> taskmanager.memory.flink.size: 10240m >>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>> taskmanager.numberOfTaskSlots: 16 >>>>> parallelism.default: 1 >>>>> >>>>> high-availability: zookeeper >>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>>>> high-availability.zookeeper.quorum: ... >>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>> >>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>> >>>>> state.backend: rocksdb >>>>> state.backend.incremental: true >>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 >>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>> >>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> wrote: >>>>> >>>>>> Hi, John >>>>>> >>>>>> Could you tell us you application scenario? Is it a flink session >>>>>> cluster with a lot of jobs? >>>>>> >>>>>> Maybe you can try to dump the memory with jmap and use tools such as >>>>>> MAT to analyze whether there are abnormal classes and classloaders >>>>>> >>>>>> >>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>> > >>>>>> > Hi running 1.14.4 >>>>>> > >>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError: >>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean >>>>>> two things: either the job requires a larger size of JVM metaspace to >>>>>> load >>>>>> classes or there is a class loading leak. >>>>>> > >>>>>> > I have 2GB of metaspace configed >>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>> > >>>>>> > But the task nodes still fail. >>>>>> > >>>>>> > When looking at the UI metrics, the metaspace starts low. Now I see >>>>>> 85% usage. It seems to be a class loading leak at this point, how can we >>>>>> debug this issue? >>>>>> >>>>>> >>>>> >>>> > >