Thank you for the quick response
Your answer related to the checkpoint folder that contains the _metadata file 
e.g. chk-1829 
What about the "shared" folder , how do I know which  files in that folder are 
still relevant and which are left over from a failed checkpoint , they are not 
directly related to the _metadata checkpoint or am I missing something?


On 2020/04/07 18:37:57, Yun Tang <myas...@live.com> wrote: 
> Hi Shachar
> 
> Why do we see data that is older from lateness configuration
> There might existed three reasons:
> 
>   1.  RocksDB really still need that file in current checkpoint. If we upload 
> one file named as 42.sst at 2/4 at some old checkpoint, current checkpoint 
> could still include that 42.sst file again if that file is never be compacted 
> since then. This is possible in theory.
>   2.  Your checkpoint size is large and checkpoint coordinator could not 
> remove as fast as possible before exit.
>   3.  That file is created by a crash task manager and not known to 
> checkpoint coordinator.
> 
> How do I know that the files belong to a valid checkpoint and not a 
> checkpoint of a crushed job - so we can delete those files
> You have to call Checkpoints#loadCheckpointMetadata[1] to load latest 
> _metadata in checkpoint directory and compare the file paths with current 
> files in checkpoint directory. The ones are not in the checkpoint meta and 
> older than latest checkpoint could be removed. You could follow this to debug 
> or maybe I could write a tool to help know what files could be deleted later.
> 
> [1] 
> https://github.com/apache/flink/blob/693cb6adc42d75d1db720b45013430a4c6817d4a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L96
> 
> Best
> Yun Tang
> 
> ________________________________
> From: Shachar Carmeli <carmeli....@gmail.com>
> Sent: Tuesday, April 7, 2020 16:19
> To: user@flink.apache.org <user@flink.apache.org>
> Subject: Flink incremental checkpointing - how long does data is kept in the 
> share folder
> 
> We are using Flink 1.6.3 and keeping the checkpoint in CEPH ,retaining only 
> one checkpoint at a time , using incremental and using rocksdb.
> 
> We run windows with lateness of 3 days , which means that we expect that no 
> data in the checkpoint share folder will be kept after 3-4 days ,Still We see 
> that there is data from more than that
> e.g.
> If today is 7/4 there are some files from the 2/4
> 
> Sometime we see checkpoints that we assume (due to the fact that its index 
> number is not in synch) that it belongs to a job that crushed and the 
> checkpoint was not used to restore the job
> 
> My questions are
> 
> Why do we see data that is older from lateness configuration
> How do I know that the files belong to a valid checkpoint and not a 
> checkpoint of a crushed job - so we can delete those files
> 

Reply via email to