Since I'm running in a container, I was able to copy some of the jars to 
Flink's lib folder. When it comes to gRPC, I don't know if there's any other 
good option due to possible issues with ThreadLocals: 
https://github.com/grpc/grpc-java/issues/8309

Even then, I'm not sure that's a complete solution. I added a class (in the lib 
folder) that logs loaded/unloaded class counts with ClassLoadingMXBean, and 
even though the number of classes loaded increases more slowly with each job, 
it still increases. In a heapdump I took before moving jars to /lib, I could 
see multiple instances (one per job, it seems) of some of my job's classes 
(e.g. sources), and their GC roots were the Flink User Class Loader. I haven't 
figured out why they would remain across different jobs.

Regards,
Alexis.

________________________________
From: Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com>
Sent: Thursday, July 8, 2021 12:51 AM
To: user@flink.apache.org <user@flink.apache.org>
Subject: Re: OOM Metaspace after multiple jobs

I now see there have been problems with this in the past:

https://issues.apache.org/jira/browse/FLINK-16142
https://issues.apache.org/jira/browse/FLINK-19005

I actually use both JDBC and gRPC, so it seems this could indeed be an issue 
for me. Does anyone know if I can ensure my classes get cleaned up? In this 
scenario only my jobs would be running in the cluster, so I can have a bit more 
control.

Regards,
Alexis.

________________________________
From: Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com>
Sent: Thursday, July 8, 2021 12:14 AM
To: user@flink.apache.org <user@flink.apache.org>
Subject: OOM Metaspace after multiple jobs

Hello,

I am currently testing a scenario where I would run the same job multiple times 
in a loop with different inputs each time. I am testing with a local Flink 
cluster v1.12.4. I initially got an OOM - Metaspace error, so I increased the 
corresponding memory in the TM's JVM (to 512m), but it still fails sometimes.

I found this issue that talked about Python jobs: 
https://issues.apache.org/jira/browse/FLINK-20333, but there is a comment there 
saying that it would also affect Java jobs. The commit linked there seems to be 
concerned with Python only. Was this also fixed in 1.12.0 for Java?

Is there anything I could do to force a more thorough class loader cleanup 
after each call to execute() ?

Regards,
Alexis.


Reply via email to