Hi Flink CDC community, I have a question regarding incremental snapshot behavior when restoring a Flink CDC job from checkpoint.
Suppose a CDC job is restored from a checkpoint and, at the same time, new tables are added into the capture list (whitelist). As I understand the current implementation: 1. Existing tables continue consuming binlog. 2. Newly added tables start incremental snapshot. 3. After all snapshot splits finish, the BinlogSplit obtains all finished snapshot split infos from SplitEnumerator. 4. The new binlog split start offset is adjusted to the minimum watermark among all completed snapshot splits. My confusion is about possible duplicate consumption. During the period when BinlogSplit runs independently (before switching to the adjusted start offset), existing tables may already consume and emit some binlog events. Later, after the binlog start offset moves backward to the minimum snapshot watermark, those previously consumed events appear to become readable again. >From my reading, shouldEmit() seems to compare the current binlog position with snapshot offsets to decide whether to emit records. However, I could not fully understand how duplication is avoided for existing tables whose records were already emitted during that standalone binlog reading period. Specifically: - Is there additional state tracking to prevent re-emitting events already processed before offset adjustment? - Or is duplication possible in this scenario and expected to be handled downstream? - Which component guarantees exactly-once semantics here? I may have misunderstood some part of the snapshot/binlog coordination logic, so any clarification would be greatly appreciated. Thanks!
