You're misinterpreting the docs.
The parent/child-first classloading controls where Flink looks for a
class /first/, specifically whether we first load from /lib or the user-jar.
It does not allow you to load something from the user-jar in the parent
classloader. That's just not how it works.
It must be in /lib.
On 27/04/2022 04:59, John Smith wrote:
Hi Chesnay as per the docs...
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
You can either put the jars in task manager lib folder or use
|classloader.parent-first-patterns-additional|
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
I prefer the latter like this: the dependency stays with the user-jar
and not on the task manager.
On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> wrote:
Ok so I should put the Apache ignite and my Microsoft drivers in
the lib folders of my task managers?
And then in my job jar only include them as compile time
dependencies?
On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
<ches...@apache.org> wrote:
JDBC drivers are well-known for leaking classloaders
unfortunately.
You have correctly identified your alternatives.
You must put the jdbc driver into /lib instead. Setting only
the parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib and the
user-jar, telling Flink to prioritize what is in lib.
On 26/04/2022 15:35, John Smith wrote:
So I put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config and so far I don't
think I'm getting "java.lang.OutOfMemoryError: Metaspace" any
more.
Or it's too early to tell.
Though now, the task managers are shutting down due to some
other failures.
So maybe because tasks were failing and reloading often the
task manager was running out of Metspace. But now maybe it's
just cleanly shutting down.
On Wed, Apr 20, 2022 at 11:35 AM John Smith
<java.dev....@gmail.com> wrote:
Or I can put in the config to treat org.apache.ignite.
classes as first class?
On Tue, Apr 19, 2022 at 10:18 PM John Smith
<java.dev....@gmail.com> wrote:
Ok, so I loaded the dump into Eclipse Mat and
followed:
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
- On the Histogram, I got over 30 entries for:
ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest
Path..." and picked "Exclude all phantom/weak/soft
references"
- Which then gave me: SqlDriverManager > Apache
Ignite JdbcThin Driver
So i'm guessing anything JDBC based. I should copy
into the task manager libs folder and my jobs make
the dependencies as compile only?
On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
<yaros...@goldsky.io> wrote:
Also
https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as
well as classloading).
On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
<ches...@apache.org> wrote:
We have a very rough "guide" in the wiki
(it's just the specific steps I took to debug
another leak):
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
On 19/04/2022 12:01, huweihua wrote:
Hi, John
Sorry for the late reply. You can use MAT[1]
to analyze the dump file. Check whether have
too many loaded classes.
[1] https://www.eclipse.org/mat/
2022年4月18日 下午9:55,John Smith
<java.dev....@gmail.com> 写道:
Hi, can anyone help with this? I never
looked at a dump file before.
On Thu, Apr 14, 2022 at 11:59 AM John Smith
<java.dev....@gmail.com> wrote:
Hi, so I have a dump file. What do I
look for?
On Thu, Mar 31, 2022 at 3:28 PM John
Smith <java.dev....@gmail.com> wrote:
Ok so if there's a leak, if I
manually stop the job and restart
it from the UI multiple times, I
won't see the issue because because
the classes are unloaded correctly?
On Thu, Mar 31, 2022 at 9:20 AM
huweihua <huweihua....@gmail.com>
wrote:
The difference is that manually
canceling the job stops the
JobMaster, but automatic
failover keeps the JobMaster
running. But looking on
TaskManager, it doesn't make
much difference
2022年3月31日 上午4:01,John Smith
<java.dev....@gmail.com> 写道:
Also if I manually cancel and
restart the same job over and
over is it the same as if
flink was restarting a job due
to failure?
I.e: When I click "Cancel Job"
on the UI is the job
completely unloaded vs when
the job scheduler restarts a
job because if whatever reason?
Lile this I'll stop and
restart the job a few times or
maybe I can trick my job to
fail and have the scheduler
restart it. Ok let me think
about this...
On Wed, Mar 30, 2022 at 10:24
AM 胡伟华
<huweihua....@gmail.com> wrote:
So if I run the same jobs
in my dev env will I
still be able to see the
similar dump?
I think running the same
job in dev should be
reproducible, maybe you
can have a try.
If not I would have to
wait at a low volume time
to do it on production.
Aldo if I recall the dump
is as big as the JVM
memory right so if I have
10GB configed for the JVM
the dump will be 10GB file?
Yes, JMAP will pause the
JVM, the time of pause
depends on the size to
dump. you can use "jmap
-dump:live" to dump only
the reachable objects,
this will take a brief pause
2022年3月30日 下午9:47,John
Smith
<java.dev....@gmail.com>
写道:
I have 3 task managers
(see config below). There
is total of 10 jobs with
25 slots being used.
The jobs are 100% ETL
I.e; They load Json,
transform it and push it
to JDBC, only 1 job of
the 10 is pushing to
Apache Ignite cluster.
FOR JMAP. I know that it
will pause the task
manager. So if I run the
same jobs in my dev env
will I still be able to
see the similar dump? I I
assume so. If not I would
have to wait at a low
volume time to do it on
production. Aldo if I
recall the dump is as big
as the JVM memory right
so if I have 10GB
configed for the JVM the
dump will be 10GB file?
# Operating system has
16GB total.
env.ssh.opts: -l flink
-oStrictHostKeyChecking=no
cluster.evenly-spread-out-slots:
true
taskmanager.memory.flink.size:
10240m
taskmanager.memory.jvm-metaspace.size:
2048m
taskmanager.numberOfTaskSlots:
16
parallelism.default: 1
high-availability: zookeeper
high-availability.storageDir:
file:///mnt/flink/ha/flink_1_14/
high-availability.zookeeper.quorum:
...
high-availability.zookeeper.path.root:
/flink_1_14
high-availability.cluster-id:
/flink_1_14_cluster_0001
web.upload.dir:
/mnt/flink/uploads/flink_1_14
state.backend: rocksdb
state.backend.incremental:
true
state.checkpoints.dir:
file:///mnt/flink/checkpoints/flink_1_14
state.savepoints.dir:
file:///mnt/flink/savepoints/flink_1_14
On Wed, Mar 30, 2022 at
2:16 AM 胡伟华
<huweihua....@gmail.com>
wrote:
Hi, John
Could you tell us you
application scenario?
Is it a flink session
cluster with a lot of
jobs?
Maybe you can try to
dump the memory with
jmap and use tools
such as MAT to
analyze whether there
are abnormal classes
and classloaders
> 2022年3月30日
上午6:09,John Smith
<java.dev....@gmail.com>
写道:
>
> Hi running 1.14.4
>
> My tasks manager
still fails with
java.lang.OutOfMemoryError:
Metaspace. The
metaspace
out-of-memory error
has occurred. This
can mean two things:
either the job
requires a larger
size of JVM metaspace
to load classes or
there is a class
loading leak.
>
> I have 2GB of
metaspace configed
taskmanager.memory.jvm-metaspace.size:
2048m
>
> But the task nodes
still fail.
>
> When looking at the
UI metrics, the
metaspace starts low.
Now I see 85% usage.
It seems to be a
class loading leak at
this point, how can
we debug this issue?