Thank you for the quick response
Your answer related to the checkpoint folder that contains the _metadata file 
e.g. chk-1829 
What about the "shared" folder , how do I know which  files in that folder are 
still relevant and which are left over from a failed checkpoint , they are not 
directly related to the _metadata checkpoint or am I missing something?

On 2020/04/07 18:37:57, Yun Tang <> wrote: 
> Hi Shachar
> Why do we see data that is older from lateness configuration
> There might existed three reasons:
>   1.  RocksDB really still need that file in current checkpoint. If we upload 
> one file named as 42.sst at 2/4 at some old checkpoint, current checkpoint 
> could still include that 42.sst file again if that file is never be compacted 
> since then. This is possible in theory.
>   2.  Your checkpoint size is large and checkpoint coordinator could not 
> remove as fast as possible before exit.
>   3.  That file is created by a crash task manager and not known to 
> checkpoint coordinator.
> How do I know that the files belong to a valid checkpoint and not a 
> checkpoint of a crushed job - so we can delete those files
> You have to call Checkpoints#loadCheckpointMetadata[1] to load latest 
> _metadata in checkpoint directory and compare the file paths with current 
> files in checkpoint directory. The ones are not in the checkpoint meta and 
> older than latest checkpoint could be removed. You could follow this to debug 
> or maybe I could write a tool to help know what files could be deleted later.
> [1] 
> Best
> Yun Tang
> ________________________________
> From: Shachar Carmeli <>
> Sent: Tuesday, April 7, 2020 16:19
> To: <>
> Subject: Flink incremental checkpointing - how long does data is kept in the 
> share folder
> We are using Flink 1.6.3 and keeping the checkpoint in CEPH ,retaining only 
> one checkpoint at a time , using incremental and using rocksdb.
> We run windows with lateness of 3 days , which means that we expect that no 
> data in the checkpoint share folder will be kept after 3-4 days ,Still We see 
> that there is data from more than that
> e.g.
> If today is 7/4 there are some files from the 2/4
> Sometime we see checkpoints that we assume (due to the fact that its index 
> number is not in synch) that it belongs to a job that crushed and the 
> checkpoint was not used to restore the job
> My questions are
> Why do we see data that is older from lateness configuration
> How do I know that the files belong to a valid checkpoint and not a 
> checkpoint of a crushed job - so we can delete those files

Reply via email to