Hi, Thanks for getting back with more information.
Apparently this is a known bug of JDK since 2003 and is still not resolved: https://bugs.java.com/view_bug.do?bug_id=4872014 <https://bugs.java.com/view_bug.do?bug_id=4872014> https://bugs.java.com/view_bug.do?bug_id=6664633 <https://bugs.java.com/view_bug.do?bug_id=6664633> Code that is using this `deleteOnExit` is not part of a Flink, but an external library that we are using (hadoop-aws:2.8.x), so we can not fix it for them and this bug should be reported/forwarded to them (I have already done just that <https://issues.apache.org/jira/browse/HADOOP-15658>). More interesting S3AOutputStream is already manually deleting those files when they are not needed in `org.apache.hadoop.fs.s3a.S3AOutputStream#close`’s finally block: } finally { if (!backupFile.delete()) { LOG.warn("Could not delete temporary s3a file: {}", backupFile); } super.close(); } But this doesn’t remove the entry from DeleteOnExitHook. From what I see in the code, flink-s3-fs-presto filesystem implantation that we provide doesn’t use deleteOnExit, so if you can switch to this filesystem it would solve the problem for you. Piotrek > On 9 Aug 2018, at 12:09, Ayush Verma <ayush.ve...@zalando.de> wrote: > > Hello Piotr, I work with Fabian and have been investigating the memory leak > associated with issues mentioned in this thread. I took a heap dump of our > master node and noticed that there was >1gb (and growing) worth of entries > in the set, /files/, in class *java.io.DeleteOnExitHook*. > Almost all the strings in this set look like, > /tmp/hadoop-root/s3a/output-*****.tmp. > > This means that the checkpointing code, which uploads the data to s3, > maintains it in a temporary local file, which is supposed to be deleted on > exit of the JVM. In our case, the checkpointing is quite heavy and because > we have a long running flink cluster, it causes this /set/ to grow > unbounded, eventually cause an OOM. Please see these images: > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1624/Screen_Shot_2018-08-09_at_11.png> > > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1624/Screen_Shot_2018-08-09_at_11.png> > > > The culprit seems to be *org.apache.hadoop.fs.s3a.S3AOutputStream*, which > in-turn, calls > *org.apache.hadoop.fs.s3a.S3AFileSystem.createTmpFileForWrite()*. If we > follow the method call chain from there, we end up at > *org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite()*, where we > can see the temp file being created and the method deleteOnExit() being > called. > > Maybe instead of relying on *deleteOnExit()* we can keep track of these tmp > files, and as soon as they are no longer required, delete them ourself. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/