JDBC drivers are well-known for leaking classloaders unfortunately.
You have correctly identified your alternatives.
You must put the jdbc driver into /lib instead. Setting only the
parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib and the user-jar,
telling Flink to prioritize what is in lib.
On 26/04/2022 15:35, John Smith wrote:
So I put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config and so far I don't think I'm
getting "java.lang.OutOfMemoryError: Metaspace" any more.
Or it's too early to tell.
Though now, the task managers are shutting down due to some
other failures.
So maybe because tasks were failing and reloading often the task
manager was running out of Metspace. But now maybe it's just
cleanly shutting down.
On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com>
wrote:
Or I can put in the config to treat org.apache.ignite. classes as
first class?
On Tue, Apr 19, 2022 at 10:18 PM John Smith
<java.dev....@gmail.com> wrote:
Ok, so I loaded the dump into Eclipse Mat and followed:
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
- On the Histogram, I got over 30 entries for:
ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest Path..." and
picked "Exclude all phantom/weak/soft references"
- Which then gave me: SqlDriverManager > Apache Ignite
JdbcThin Driver
So i'm guessing anything JDBC based. I should copy into the
task manager libs folder and my jobs make the dependencies as
compile only?
On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
<yaros...@goldsky.io> wrote:
Also
https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as well as
classloading).
On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
<ches...@apache.org> wrote:
We have a very rough "guide" in the wiki (it's just
the specific steps I took to debug another leak):
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
On 19/04/2022 12:01, huweihua wrote:
Hi, John
Sorry for the late reply. You can use MAT[1] to
analyze the dump file. Check whether have too many
loaded classes.
[1] https://www.eclipse.org/mat/
2022年4月18日 下午9:55,John Smith
<java.dev....@gmail.com> 写道:
Hi, can anyone help with this? I never looked at a
dump file before.
On Thu, Apr 14, 2022 at 11:59 AM John Smith
<java.dev....@gmail.com> wrote:
Hi, so I have a dump file. What do I look for?
On Thu, Mar 31, 2022 at 3:28 PM John Smith
<java.dev....@gmail.com> wrote:
Ok so if there's a leak, if I manually stop
the job and restart it from the UI multiple
times, I won't see the issue because because
the classes are unloaded correctly?
On Thu, Mar 31, 2022 at 9:20 AM huweihua
<huweihua....@gmail.com> wrote:
The difference is that manually
canceling the job stops the JobMaster,
but automatic failover keeps the
JobMaster running. But looking on
TaskManager, it doesn't make much difference
2022年3月31日 上午4:01,John Smith
<java.dev....@gmail.com> 写道:
Also if I manually cancel and restart
the same job over and over is it the
same as if flink was restarting a job
due to failure?
I.e: When I click "Cancel Job" on the
UI is the job completely unloaded vs
when the job scheduler restarts a job
because if whatever reason?
Lile this I'll stop and restart the job
a few times or maybe I can trick my job
to fail and have the scheduler restart
it. Ok let me think about this...
On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
<huweihua....@gmail.com> wrote:
So if I run the same jobs in my
dev env will I still be able to
see the similar dump?
I think running the same job in dev
should be reproducible, maybe you
can have a try.
If not I would have to wait at a
low volume time to do it on
production. Aldo if I recall the
dump is as big as the JVM memory
right so if I have 10GB configed
for the JVM the dump will be 10GB
file?
Yes, JMAP will pause the JVM, the
time of pause depends on the size
to dump. you can use "jmap
-dump:live" to dump only the
reachable objects, this will take a
brief pause
2022年3月30日 下午9:47,John Smith
<java.dev....@gmail.com> 写道:
I have 3 task managers (see config
below). There is total of 10 jobs
with 25 slots being used.
The jobs are 100% ETL I.e; They
load Json, transform it and push
it to JDBC, only 1 job of the 10
is pushing to Apache Ignite cluster.
FOR JMAP. I know that it will
pause the task manager. So if I
run the same jobs in my dev env
will I still be able to see the
similar dump? I I assume so. If
not I would have to wait at a low
volume time to do it on
production. Aldo if I recall the
dump is as big as the JVM memory
right so if I have 10GB configed
for the JVM the dump will be 10GB
file?
# Operating system has 16GB total.
env.ssh.opts: -l flink
-oStrictHostKeyChecking=no
cluster.evenly-spread-out-slots: true
taskmanager.memory.flink.size: 10240m
taskmanager.memory.jvm-metaspace.size:
2048m
taskmanager.numberOfTaskSlots: 16
parallelism.default: 1
high-availability: zookeeper
high-availability.storageDir:
file:///mnt/flink/ha/flink_1_14/
high-availability.zookeeper.quorum:
...
high-availability.zookeeper.path.root:
/flink_1_14
high-availability.cluster-id:
/flink_1_14_cluster_0001
web.upload.dir:
/mnt/flink/uploads/flink_1_14
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir:
file:///mnt/flink/checkpoints/flink_1_14
state.savepoints.dir:
file:///mnt/flink/savepoints/flink_1_14
On Wed, Mar 30, 2022 at 2:16 AM
胡伟华 <huweihua....@gmail.com> wrote:
Hi, John
Could you tell us you
application scenario? Is it a
flink session cluster with a
lot of jobs?
Maybe you can try to dump the
memory with jmap and use tools
such as MAT to analyze whether
there are abnormal classes and
classloaders
> 2022年3月30日 上午6:09,John
Smith <java.dev....@gmail.com>
写道:
>
> Hi running 1.14.4
>
> My tasks manager still fails
with
java.lang.OutOfMemoryError:
Metaspace. The metaspace
out-of-memory error has
occurred. This can mean two
things: either the job
requires a larger size of JVM
metaspace to load classes or
there is a class loading leak.
>
> I have 2GB of metaspace
configed
taskmanager.memory.jvm-metaspace.size:
2048m
>
> But the task nodes still fail.
>
> When looking at the UI
metrics, the metaspace starts
low. Now I see 85% usage. It
seems to be a class loading
leak at this point, how can we
debug this issue?