[ https://issues.apache.org/jira/browse/FLINK-35849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hongshun Wang updated FLINK-35849: ---------------------------------- Summary: [flink-cdc] Use expose_snapshot to read snapshot splits of postgres cdc connector. (was: [flink-cdc] Use expose_snapshot to read snapshot of postgres cdc connector.) > [flink-cdc] Use expose_snapshot to read snapshot splits of postgres cdc > connector. > ---------------------------------------------------------------------------------- > > Key: FLINK-35849 > URL: https://issues.apache.org/jira/browse/FLINK-35849 > Project: Flink > Issue Type: New Feature > Components: Flink CDC > Affects Versions: cdc-3.1.1 > Reporter: Hongshun Wang > Priority: Major > Fix For: cdc-3.3.0 > > > In current postgres cdc connector, we use incremental framework to read > data[1], which include the following step: > # create a global slot in case that the wal log be recycle. > # Enumerator split the table into multiple chunks(named "snapshot split" in > cdc), than assigned this snapshot splits to the readers. > # The read read the snapshot data of the snapshot split and backfill log. > Each reader need a temporary slot to read log. > # when all snapshot snapshots are finished, enumerator will send a stream > split to reader. The one reader will read log. > > However, read backfill log will also increase burden in source database. For > example, the Postgres cdc connector will establish many logical replication > connections to the Postgres database, which can easily reach the > max_sender_num or max_slot_number limit. Assuming there are 10 Postgres cdc > sources and each runs 4 parallel processes, a total of 10*(4+1) = 50 > replication connections will be created.In many situations, the sink > databases provides idempotence. Therefore, We can also support at-least-once > semantics by skipping the backfill period, which will reduce budget on the > source databases. Users can choose between at-least-once or exactly-once > based on their demands.[2] > > The two methods make a tradeoff between semantics and performance. Is there > any other method to do well in both? > It seems expose_snapshot[3] can do both. When creating global slot, we can > save the the snapshot name, and search it in snapshot split reading(thus no > need to read backfill log). Then we just read the wal-log based on global > slot. It can also provide exactly-once semantics. > And expose_snapshot is also a default behavior when create a new replication > slot, thus will not occur other side effects . > > > > > > [1] [https://github.com/apache/flink-cdc/pull/2216] > [2][https://github.com/apache/flink-cdc/issues/2553] > [3] [https://www.postgresql.org/docs/14/protocol-replication.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010)