Hi Alexis, I hope I'm not stating the obvious, but have you checked this documentation page: https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/#unloading-of-dynamically-loaded-classes-in-user-code In particular the shutdown hooks we've introduced in Flink 1.13 could be helpful for you (this is an example of how we use the hooks with the Kinesis connector, which also produced leaks: https://github.com/apache/flink/pull/14372/files)
Also, check this out: https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks (FYI: I won't be able to follow up on this thread for a while, because I'm going on vacation soon) On Fri, Jul 16, 2021 at 9:24 AM Alexis Sarda-Espinosa < alexis.sarda-espin...@microfocus.com> wrote: > Since I'm running in a container, I was able to copy some of the jars to > Flink's lib folder. When it comes to gRPC, I don't know if there's any > other good option due to possible issues with ThreadLocals: > https://github.com/grpc/grpc-java/issues/8309 > > > > Even then, I’m not sure that’s a complete solution. I added a class (in > the lib folder) that logs loaded/unloaded class counts with > ClassLoadingMXBean, and even though the number of classes loaded increases > more slowly with each job, it still increases. In a heapdump I took before > moving jars to /lib, I could see multiple instances (one per job, it seems) > of some of my job’s classes (e.g. sources), and their GC roots were the > Flink User Class Loader. I haven’t figured out why they would remain across > different jobs. > > > > Regards, > > Alexis. > > > ------------------------------ > > *From:* Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com> > *Sent:* Thursday, July 8, 2021 12:51 AM > *To:* user@flink.apache.org <user@flink.apache.org> > *Subject:* Re: OOM Metaspace after multiple jobs > > > > I now see there have been problems with this in the past: > > > > https://issues.apache.org/jira/browse/FLINK-16142 > > https://issues.apache.org/jira/browse/FLINK-19005 > > > > I actually use both JDBC and gRPC, so it seems this could indeed be an > issue for me. Does anyone know if I can ensure my classes get cleaned up? > In this scenario only my jobs would be running in the cluster, so I can > have a bit more control. > > > > Regards, > > Alexis. > > > ------------------------------ > > *From:* Alexis Sarda-Espinosa <alexis.sarda-espin...@microfocus.com> > *Sent:* Thursday, July 8, 2021 12:14 AM > *To:* user@flink.apache.org <user@flink.apache.org> > *Subject:* OOM Metaspace after multiple jobs > > > > Hello, > > > > I am currently testing a scenario where I would run the same job multiple > times in a loop with different inputs each time. I am testing with a local > Flink cluster v1.12.4. I initially got an OOM - Metaspace error, so I > increased the corresponding memory in the TM's JVM (to 512m), but it still > fails sometimes. > > > > I found this issue that talked about Python jobs: > https://issues.apache.org/jira/browse/FLINK-20333, but there is a comment > there saying that it would also affect Java jobs. The commit linked there > seems to be concerned with Python only. Was this also fixed in 1.12.0 for > Java? > > > > Is there anything I could do to force a more thorough class loader cleanup > after each call to execute() ? > > > > Regards, > > Alexis. > > > > >