Since I'm running in a container, I was able to copy some of the jars to Flink's lib folder. When it comes to gRPC, I don't know if there's any other good option due to possible issues with ThreadLocals: https://github.com/grpc/grpc-java/issues/8309
Even then, I'm not sure that's a complete solution. I added a class (in the lib folder) that logs loaded/unloaded class counts with ClassLoadingMXBean, and even though the number of classes loaded increases more slowly with each job, it still increases. In a heapdump I took before moving jars to /lib, I could see multiple instances (one per job, it seems) of some of my job's classes (e.g. sources), and their GC roots were the Flink User Class Loader. I haven't figured out why they would remain across different jobs. Regards, Alexis. ________________________________ From: Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com> Sent: Thursday, July 8, 2021 12:51 AM To: user@flink.apache.org <user@flink.apache.org> Subject: Re: OOM Metaspace after multiple jobs I now see there have been problems with this in the past: https://issues.apache.org/jira/browse/FLINK-16142 https://issues.apache.org/jira/browse/FLINK-19005 I actually use both JDBC and gRPC, so it seems this could indeed be an issue for me. Does anyone know if I can ensure my classes get cleaned up? In this scenario only my jobs would be running in the cluster, so I can have a bit more control. Regards, Alexis. ________________________________ From: Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com> Sent: Thursday, July 8, 2021 12:14 AM To: user@flink.apache.org <user@flink.apache.org> Subject: OOM Metaspace after multiple jobs Hello, I am currently testing a scenario where I would run the same job multiple times in a loop with different inputs each time. I am testing with a local Flink cluster v1.12.4. I initially got an OOM - Metaspace error, so I increased the corresponding memory in the TM's JVM (to 512m), but it still fails sometimes. I found this issue that talked about Python jobs: https://issues.apache.org/jira/browse/FLINK-20333, but there is a comment there saying that it would also affect Java jobs. The commit linked there seems to be concerned with Python only. Was this also fixed in 1.12.0 for Java? Is there anything I could do to force a more thorough class loader cleanup after each call to execute() ? Regards, Alexis.