Hello,
I would like clarification on the StreamingFileSink, thank you.

>From my testing, it seems that resuming job from checkpoint does *not* also
restore the rolling part counter.

E.g, job may have stopped with last file:
*part-6-71*

But when resuming from most recent checkpoint:
*part-6-89*
(There is unexplained gap).

This is a problem if I am having an issue with my job, and need to roll
back *more than one checkpoint*. After rolling back to the 4th last
checkpoint, e.g, the data will be written into *different part file names*,
causing duplication.
-----------------------------------------------------------------
For example, checkpoints:
*chk-17, chk-18, chk-19, chk-20*

Original data:
*part-1-5, part-1-6, part-1-7*

Rollback to *chk-17*, which writes *part-1-18*, but with the same data as
*part-1-5*! This is duplicate.
------------------------------------------------------------------
Am I correct? How to avoid this?

Reply via email to