The difference is that manually canceling the job stops the JobMaster, but automatic failover keeps the JobMaster running. But looking on TaskManager, it doesn't make much difference
> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: > > Also if I manually cancel and restart the same job over and over is it the > same as if flink was restarting a job due to failure? > > I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs > when the job scheduler restarts a job because if whatever reason? > > Lile this I'll stop and restart the job a few times or maybe I can trick my > job to fail and have the scheduler restart it. Ok let me think about this... > > On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com > <mailto:huweihua....@gmail.com>> wrote: >> So if I run the same jobs in my dev env will I still be able to see the >> similar dump? > I think running the same job in dev should be reproducible, maybe you can > have a try. > >> If not I would have to wait at a low volume time to do it on production. >> Aldo if I recall the dump is as big as the JVM memory right so if I have >> 10GB configed for the JVM the dump will be 10GB file? > > Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. > you can use "jmap -dump:live" to dump only the reachable objects, this will > take a brief pause > > > >> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com >> <mailto:java.dev....@gmail.com>> 写道: >> >> I have 3 task managers (see config below). There is total of 10 jobs with 25 >> slots being used. >> The jobs are 100% ETL I.e; They load Json, transform it and push it to JDBC, >> only 1 job of the 10 is pushing to Apache Ignite cluster. >> >> FOR JMAP. I know that it will pause the task manager. So if I run the same >> jobs in my dev env will I still be able to see the similar dump? I I assume >> so. If not I would have to wait at a low volume time to do it on production. >> Aldo if I recall the dump is as big as the JVM memory right so if I have >> 10GB configed for the JVM the dump will be 10GB file? >> >> >> # Operating system has 16GB total. >> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >> >> cluster.evenly-spread-out-slots: true >> >> taskmanager.memory.flink.size: 10240m >> taskmanager.memory.jvm-metaspace.size: 2048m >> taskmanager.numberOfTaskSlots: 16 >> parallelism.default: 1 >> >> high-availability: zookeeper >> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ <> >> high-availability.zookeeper.quorum: ... >> high-availability.zookeeper.path.root: /flink_1_14 >> high-availability.cluster-id: /flink_1_14_cluster_0001 >> >> web.upload.dir: /mnt/flink/uploads/flink_1_14 >> >> state.backend: rocksdb >> state.backend.incremental: true >> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 <> >> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 <> >> >> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com >> <mailto:huweihua....@gmail.com>> wrote: >> Hi, John >> >> Could you tell us you application scenario? Is it a flink session cluster >> with a lot of jobs? >> >> Maybe you can try to dump the memory with jmap and use tools such as MAT to >> analyze whether there are abnormal classes and classloaders >> >> >> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com >> > <mailto:java.dev....@gmail.com>> 写道: >> > >> > Hi running 1.14.4 >> > >> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. >> > The metaspace out-of-memory error has occurred. This can mean two things: >> > either the job requires a larger size of JVM metaspace to load classes or >> > there is a class loading leak. >> > >> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: >> > 2048m >> > >> > But the task nodes still fail. >> > >> > When looking at the UI metrics, the metaspace starts low. Now I see 85% >> > usage. It seems to be a class loading leak at this point, how can we debug >> > this issue? >> >