We’re using Flink 1.3.1 on Mesos, with HA/recovery stored in S3 using
RocksDB with incremental checkpointing.

We have enabled external checkpoints (every 30s), retaining the two latest
external checkpoints.

We are trying to track down something we see happening where the recovery,
checkpoint and external checkpoints directories literally explode in size.
  When we have about 20-30 jobs running, we see these directories grow from
60-70 files to hundreds of thousands of files.


For e.g.  here are stats on a recent state we got to

flink/checkpoints    774824 Objects 5.4 GB
flink/ext-checkpoints 229300 Objects 171.8 MB
flink/recovery 229531 Objects - 909.7MB

Under these circumstances, the JobManager/AppMaster becomes completely
unresponsive.  The only option seems to be to stop it, cull those
directories and restart, which is not ideal.

Appreciate any insight on this…

Also, is there any documentation on best/ops practices on maintaining these
directories? For e.g. is it OK  to have a reaper script that cleans out
these directories for everything that is say 3 days old?

Thanks
Prashant

Reply via email to