ChaoFang created FLINK-38450:
--------------------------------

             Summary: flink-cdc sink to iceberg table record duplication
                 Key: FLINK-38450
                 URL: https://issues.apache.org/jira/browse/FLINK-38450
             Project: Flink
          Issue Type: Bug
          Components: Flink CDC
    Affects Versions: 3.0.0
            Reporter: ChaoFang
             Fix For: cdc-3.5.0


When multiple modifications to the same PrimaryId occur within a single 
checkpoint and are flushed separately, records are written to different file 
batches (data/eqDelete/posDelete files) sharing the same seq_number. During 
reads, the eqDeleteFile fails to correctly filter stale data, resulting in 
duplicate records.

*Repro Steps* (in 1 checkpoint):
 # Got Rcoed(id=1,name="a") insert event.
 # Got flush event (Triggered from some other change).
 # Rcoed(id=1,name="2") flush to w1-0001.parquet (with it deleteFiles).
 # Got Rcoed(id=1,name="b") update event.
 # New writer.
 # Rcoed(id=1,name="b")  flush to w2-0001.parquet (with it deleteFiles).
 # Got checkpoint commit event.
 # commit w1-0001.parquet, w2-0001.parquet (with their deleteFiles) metadata to 
snapshot.

Since it is a new writer, the distribution of data records cannot be correct 
(for example, the distribution of data records to different files by Hash).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to