Thank you for the quick response Your answer related to the checkpoint folder that contains the _metadata file e.g. chk-1829 What about the "shared" folder , how do I know which files in that folder are still relevant and which are left over from a failed checkpoint , they are not directly related to the _metadata checkpoint or am I missing something?
On 2020/04/07 18:37:57, Yun Tang <myas...@live.com> wrote: > Hi Shachar > > Why do we see data that is older from lateness configuration > There might existed three reasons: > > 1. RocksDB really still need that file in current checkpoint. If we upload > one file named as 42.sst at 2/4 at some old checkpoint, current checkpoint > could still include that 42.sst file again if that file is never be compacted > since then. This is possible in theory. > 2. Your checkpoint size is large and checkpoint coordinator could not > remove as fast as possible before exit. > 3. That file is created by a crash task manager and not known to > checkpoint coordinator. > > How do I know that the files belong to a valid checkpoint and not a > checkpoint of a crushed job - so we can delete those files > You have to call Checkpoints#loadCheckpointMetadata[1] to load latest > _metadata in checkpoint directory and compare the file paths with current > files in checkpoint directory. The ones are not in the checkpoint meta and > older than latest checkpoint could be removed. You could follow this to debug > or maybe I could write a tool to help know what files could be deleted later. > > [1] > https://github.com/apache/flink/blob/693cb6adc42d75d1db720b45013430a4c6817d4a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L96 > > Best > Yun Tang > > ________________________________ > From: Shachar Carmeli <carmeli....@gmail.com> > Sent: Tuesday, April 7, 2020 16:19 > To: user@flink.apache.org <user@flink.apache.org> > Subject: Flink incremental checkpointing - how long does data is kept in the > share folder > > We are using Flink 1.6.3 and keeping the checkpoint in CEPH ,retaining only > one checkpoint at a time , using incremental and using rocksdb. > > We run windows with lateness of 3 days , which means that we expect that no > data in the checkpoint share folder will be kept after 3-4 days ,Still We see > that there is data from more than that > e.g. > If today is 7/4 there are some files from the 2/4 > > Sometime we see checkpoints that we assume (due to the fact that its index > number is not in synch) that it belongs to a job that crushed and the > checkpoint was not used to restore the job > > My questions are > > Why do we see data that is older from lateness configuration > How do I know that the files belong to a valid checkpoint and not a > checkpoint of a crushed job - so we can delete those files >