We have a very rough "guide" in the wiki (it's just the specific steps I
took to debug another leak):
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
On 19/04/2022 12:01, huweihua wrote:
Hi, John
Sorry for the late reply. You can use MAT[1] to analyze the dump file.
Check whether have too many loaded classes.
[1] https://www.eclipse.org/mat/
2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道:
Hi, can anyone help with this? I never looked at a dump file before.
On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com>
wrote:
Hi, so I have a dump file. What do I look for?
On Thu, Mar 31, 2022 at 3:28 PM John Smith
<java.dev....@gmail.com> wrote:
Ok so if there's a leak, if I manually stop the job and
restart it from the UI multiple times, I won't see the issue
because because the classes are unloaded correctly?
On Thu, Mar 31, 2022 at 9:20 AM huweihua
<huweihua....@gmail.com> wrote:
The difference is that manually canceling the job stops
the JobMaster, but automatic failover keeps the JobMaster
running. But looking on TaskManager, it doesn't make much
difference
2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com>
写道:
Also if I manually cancel and restart the same job over
and over is it the same as if flink was restarting a job
due to failure?
I.e: When I click "Cancel Job" on the UI is the job
completely unloaded vs when the job scheduler restarts a
job because if whatever reason?
Lile this I'll stop and restart the job a few times or
maybe I can trick my job to fail and have the scheduler
restart it. Ok let me think about this...
On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
<huweihua....@gmail.com> wrote:
So if I run the same jobs in my dev env will I
still be able to see the similar dump?
I think running the same job in dev should be
reproducible, maybe you can have a try.
If not I would have to wait at a low volume time
to do it on production. Aldo if I recall the dump
is as big as the JVM memory right so if I have 10GB
configed for the JVM the dump will be 10GB file?
Yes, JMAP will pause the JVM, the time of pause
depends on the size to dump. you can use "jmap
-dump:live" to dump only the reachable objects, this
will take a brief pause
2022年3月30日 下午9:47,John Smith
<java.dev....@gmail.com> 写道:
I have 3 task managers (see config below). There is
total of 10 jobs with 25 slots being used.
The jobs are 100% ETL I.e; They load Json,
transform it and push it to JDBC, only 1 job of the
10 is pushing to Apache Ignite cluster.
FOR JMAP. I know that it will pause the task
manager. So if I run the same jobs in my dev env
will I still be able to see the similar dump? I I
assume so. If not I would have to wait at a low
volume time to do it on production. Aldo if I
recall the dump is as big as the JVM memory right
so if I have 10GB configed for the JVM the dump
will be 10GB file?
# Operating system has 16GB total.
env.ssh.opts: -l flink -oStrictHostKeyChecking=no
cluster.evenly-spread-out-slots: true
taskmanager.memory.flink.size: 10240m
taskmanager.memory.jvm-metaspace.size: 2048m
taskmanager.numberOfTaskSlots: 16
parallelism.default: 1
high-availability: zookeeper
high-availability.storageDir:
file:///mnt/flink/ha/flink_1_14/
high-availability.zookeeper.quorum: ...
high-availability.zookeeper.path.root: /flink_1_14
high-availability.cluster-id: /flink_1_14_cluster_0001
web.upload.dir: /mnt/flink/uploads/flink_1_14
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir:
file:///mnt/flink/checkpoints/flink_1_14
state.savepoints.dir:
file:///mnt/flink/savepoints/flink_1_14
On Wed, Mar 30, 2022 at 2:16 AM 胡伟华
<huweihua....@gmail.com> wrote:
Hi, John
Could you tell us you application scenario? Is
it a flink session cluster with a lot of jobs?
Maybe you can try to dump the memory with jmap
and use tools such as MAT to analyze whether
there are abnormal classes and classloaders
> 2022年3月30日 上午6:09,John Smith
<java.dev....@gmail.com> 写道:
>
> Hi running 1.14.4
>
> My tasks manager still fails with
java.lang.OutOfMemoryError: Metaspace. The
metaspace out-of-memory error has occurred.
This can mean two things: either the job
requires a larger size of JVM metaspace to load
classes or there is a class loading leak.
>
> I have 2GB of metaspace configed
taskmanager.memory.jvm-metaspace.size: 2048m
>
> But the task nodes still fail.
>
> When looking at the UI metrics, the metaspace
starts low. Now I see 85% usage. It seems to be
a class loading leak at this point, how can we
debug this issue?