Hi, StreamingFileSink would not remove committed files, so if you use a non-latest checkpoint to restore state, you may need to perform a manual cleanup.
WRT the part id issue, StreamingFileSink will track the global max part number, and use this value + 1 as the new id upon restoring. In this way, we avoid file name conflicts with the previous execution (see[1]). [1] https://github.com/apache/flink/blob/93dfdd05a84f933473c7b22437e12c03239f9462/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L276 Best, Paul Lam > 在 2019年11月21日,10:01,Lei Nie <lyzj...@gmail.com> 写道: > > Hello, > I would like clarification on the StreamingFileSink, thank you. > > From my testing, it seems that resuming job from checkpoint does not also > restore the rolling part counter. > > E.g, job may have stopped with last file: > part-6-71 > > But when resuming from most recent checkpoint: > part-6-89 > (There is unexplained gap). > > This is a problem if I am having an issue with my job, and need to roll back > more than one checkpoint. After rolling back to the 4th last checkpoint, e.g, > the data will be written into different part file names, causing duplication. > ----------------------------------------------------------------- > For example, checkpoints: > chk-17, chk-18, chk-19, chk-20 > > Original data: > part-1-5, part-1-6, part-1-7 > > Rollback to chk-17, which writes part-1-18, but with the same data as > part-1-5! This is duplicate. > ------------------------------------------------------------------ > Am I correct? How to avoid this?