We’re using Flink 1.3.1 on Mesos, with HA/recovery stored in S3 using RocksDB with incremental checkpointing.
We have enabled external checkpoints (every 30s), retaining the two latest external checkpoints. We are trying to track down something we see happening where the recovery, checkpoint and external checkpoints directories literally explode in size. When we have about 20-30 jobs running, we see these directories grow from 60-70 files to hundreds of thousands of files. For e.g. here are stats on a recent state we got to flink/checkpoints 774824 Objects 5.4 GB flink/ext-checkpoints 229300 Objects 171.8 MB flink/recovery 229531 Objects - 909.7MB Under these circumstances, the JobManager/AppMaster becomes completely unresponsive. The only option seems to be to stop it, cull those directories and restart, which is not ideal. Appreciate any insight on this… Also, is there any documentation on best/ops practices on maintaining these directories? For e.g. is it OK to have a reaper script that cleans out these directories for everything that is say 3 days old? Thanks Prashant