Hello Piotr, I work with Fabian and have been investigating the memory leak associated with issues mentioned in this thread. I took a heap dump of our master node and noticed that there was >1gb (and growing) worth of entries in the set, /files/, in class *java.io.DeleteOnExitHook*. Almost all the strings in this set look like, /tmp/hadoop-root/s3a/output-*****.tmp.
This means that the checkpointing code, which uploads the data to s3, maintains it in a temporary local file, which is supposed to be deleted on exit of the JVM. In our case, the checkpointing is quite heavy and because we have a long running flink cluster, it causes this /set/ to grow unbounded, eventually cause an OOM. Please see these images: <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1624/Screen_Shot_2018-08-09_at_11.png> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1624/Screen_Shot_2018-08-09_at_11.png> The culprit seems to be *org.apache.hadoop.fs.s3a.S3AOutputStream*, which in-turn, calls *org.apache.hadoop.fs.s3a.S3AFileSystem.createTmpFileForWrite()*. If we follow the method call chain from there, we end up at *org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite()*, where we can see the temp file being created and the method deleteOnExit() being called. Maybe instead of relying on *deleteOnExit()* we can keep track of these tmp files, and as soon as they are no longer required, delete them ourself. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/