ChaoFang created FLINK-38450:
--------------------------------
Summary: flink-cdc sink to iceberg table record duplication
Key: FLINK-38450
URL: https://issues.apache.org/jira/browse/FLINK-38450
Project: Flink
Issue Type: Bug
Components: Flink CDC
Affects Versions: 3.0.0
Reporter: ChaoFang
Fix For: cdc-3.5.0
When multiple modifications to the same PrimaryId occur within a single
checkpoint and are flushed separately, records are written to different file
batches (data/eqDelete/posDelete files) sharing the same seq_number. During
reads, the eqDeleteFile fails to correctly filter stale data, resulting in
duplicate records.
*Repro Steps* (in 1 checkpoint):
# Got Rcoed(id=1,name="a") insert event.
# Got flush event (Triggered from some other change).
# Rcoed(id=1,name="2") flush to w1-0001.parquet (with it deleteFiles).
# Got Rcoed(id=1,name="b") update event.
# New writer.
# Rcoed(id=1,name="b") flush to w2-0001.parquet (with it deleteFiles).
# Got checkpoint commit event.
# commit w1-0001.parquet, w2-0001.parquet (with their deleteFiles) metadata to
snapshot.
Since it is a new writer, the distribution of data records cannot be correct
(for example, the distribution of data records to different files by Hash).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)