Hi, so I have a dump file. What do I look for? On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com> wrote:
> Ok so if there's a leak, if I manually stop the job and restart it from > the UI multiple times, I won't see the issue because because the classes > are unloaded correctly? > > > On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> wrote: > >> >> The difference is that manually canceling the job stops the JobMaster, >> but automatic failover keeps the JobMaster running. But looking on >> TaskManager, it doesn't make much difference >> >> >> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >> >> Also if I manually cancel and restart the same job over and over is it >> the same as if flink was restarting a job due to failure? >> >> I.e: When I click "Cancel Job" on the UI is the job completely unloaded >> vs when the job scheduler restarts a job because if whatever reason? >> >> Lile this I'll stop and restart the job a few times or maybe I can trick >> my job to fail and have the scheduler restart it. Ok let me think about >> this... >> >> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> wrote: >> >>> So if I run the same jobs in my dev env will I still be able to see the >>> similar dump? >>> >>> I think running the same job in dev should be reproducible, maybe you >>> can have a try. >>> >>> If not I would have to wait at a low volume time to do it on >>> production. Aldo if I recall the dump is as big as the JVM memory right so >>> if I have 10GB configed for the JVM the dump will be 10GB file? >>> >>> Yes, JMAP will pause the JVM, the time of pause depends on the size to >>> dump. you can use "jmap -dump:live" to dump only the reachable objects, >>> this will take a brief pause >>> >>> >>> >>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>> >>> I have 3 task managers (see config below). There is total of 10 jobs >>> with 25 slots being used. >>> The jobs are 100% ETL I.e; They load Json, transform it and push it to >>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster. >>> >>> FOR JMAP. I know that it will pause the task manager. So if I run the >>> same jobs in my dev env will I still be able to see the similar dump? I I >>> assume so. If not I would have to wait at a low volume time to do it on >>> production. Aldo if I recall the dump is as big as the JVM memory right so >>> if I have 10GB configed for the JVM the dump will be 10GB file? >>> >>> >>> # Operating system has 16GB total. >>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>> >>> cluster.evenly-spread-out-slots: true >>> >>> taskmanager.memory.flink.size: 10240m >>> taskmanager.memory.jvm-metaspace.size: 2048m >>> taskmanager.numberOfTaskSlots: 16 >>> parallelism.default: 1 >>> >>> high-availability: zookeeper >>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>> high-availability.zookeeper.quorum: ... >>> high-availability.zookeeper.path.root: /flink_1_14 >>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>> >>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>> >>> state.backend: rocksdb >>> state.backend.incremental: true >>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 >>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>> >>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> wrote: >>> >>>> Hi, John >>>> >>>> Could you tell us you application scenario? Is it a flink session >>>> cluster with a lot of jobs? >>>> >>>> Maybe you can try to dump the memory with jmap and use tools such as >>>> MAT to analyze whether there are abnormal classes and classloaders >>>> >>>> >>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>> > >>>> > Hi running 1.14.4 >>>> > >>>> > My tasks manager still fails with java.lang.OutOfMemoryError: >>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean >>>> two things: either the job requires a larger size of JVM metaspace to load >>>> classes or there is a class loading leak. >>>> > >>>> > I have 2GB of metaspace configed >>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>> > >>>> > But the task nodes still fail. >>>> > >>>> > When looking at the UI metrics, the metaspace starts low. Now I see >>>> 85% usage. It seems to be a class loading leak at this point, how can we >>>> debug this issue? >>>> >>>> >>> >>