Hi,

StreamingFileSink would not remove committed files, so if you use a non-latest 
checkpoint to restore state, you may need to perform a manual cleanup.

WRT the part id issue, StreamingFileSink will track the global max part number, 
and use this value + 1 as the new id upon restoring. In this way, we avoid file 
name conflicts with the previous execution (see[1]).

[1] 
https://github.com/apache/flink/blob/93dfdd05a84f933473c7b22437e12c03239f9462/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L276

Best,
Paul Lam

> 在 2019年11月21日,10:01,Lei Nie <lyzj...@gmail.com> 写道:
> 
> Hello,
> I would like clarification on the StreamingFileSink, thank you.
> 
> From my testing, it seems that resuming job from checkpoint does not also 
> restore the rolling part counter.
> 
> E.g, job may have stopped with last file:
> part-6-71
> 
> But when resuming from most recent checkpoint:
> part-6-89
> (There is unexplained gap).
> 
> This is a problem if I am having an issue with my job, and need to roll back 
> more than one checkpoint. After rolling back to the 4th last checkpoint, e.g, 
> the data will be written into different part file names, causing duplication.
> -----------------------------------------------------------------
> For example, checkpoints:
> chk-17, chk-18, chk-19, chk-20
> 
> Original data:
> part-1-5, part-1-6, part-1-7
> 
> Rollback to chk-17, which writes part-1-18, but with the same data as 
> part-1-5! This is duplicate.
> ------------------------------------------------------------------
> Am I correct? How to avoid this?

Reply via email to