Hi Flink CDC community,

I have a question regarding incremental snapshot behavior when restoring a
Flink CDC job from checkpoint.

Suppose a CDC job is restored from a checkpoint and, at the same time, new
tables are added into the capture list (whitelist).

As I understand the current implementation:

   1.

   Existing tables continue consuming binlog.
   2.

   Newly added tables start incremental snapshot.
   3.

   After all snapshot splits finish, the BinlogSplit obtains all finished
   snapshot split infos from SplitEnumerator.
   4.

   The new binlog split start offset is adjusted to the minimum watermark
   among all completed snapshot splits.

My confusion is about possible duplicate consumption.

During the period when BinlogSplit runs independently (before switching to
the adjusted start offset), existing tables may already consume and emit
some binlog events.

Later, after the binlog start offset moves backward to the minimum snapshot
watermark, those previously consumed events appear to become readable again.

>From my reading, shouldEmit() seems to compare the current binlog position
with snapshot offsets to decide whether to emit records.

However, I could not fully understand how duplication is avoided for
existing tables whose records were already emitted during that standalone
binlog reading period.

Specifically:

   -

   Is there additional state tracking to prevent re-emitting events already
   processed before offset adjustment?
   -

   Or is duplication possible in this scenario and expected to be handled
   downstream?
   -

   Which component guarantees exactly-once semantics here?

I may have misunderstood some part of the snapshot/binlog coordination
logic, so any clarification would be greatly appreciated.

Thanks!

Reply via email to