We are using Flink 1.6.3 and keeping the checkpoint in CEPH ,retaining only one checkpoint at a time , using incremental and using rocksdb.
We run windows with lateness of 3 days , which means that we expect that no data in the checkpoint share folder will be kept after 3-4 days ,Still We see that there is data from more than that e.g. If today is 7/4 there are some files from the 2/4 Sometime we see checkpoints that we assume (due to the fact that its index number is not in synch) that it belongs to a job that crushed and the checkpoint was not used to restore the job My questions are Why do we see data that is older from lateness configuration How do I know that the files belong to a valid checkpoint and not a checkpoint of a crushed job - so we can delete those files