We are using Flink 1.6.3 and keeping the checkpoint in CEPH ,retaining only one 
checkpoint at a time , using incremental and using rocksdb.

We run windows with lateness of 3 days , which means that we expect that no 
data in the checkpoint share folder will be kept after 3-4 days ,Still We see 
that there is data from more than that
e.g.
If today is 7/4 there are some files from the 2/4

Sometime we see checkpoints that we assume (due to the fact that its index 
number is not in synch) that it belongs to a job that crushed and the 
checkpoint was not used to restore the job

My questions are

Why do we see data that is older from lateness configuration
How do I know that the files belong to a valid checkpoint and not a checkpoint 
of a crushed job - so we can delete those files

Reply via email to