[ https://issues.apache.org/jira/browse/FLINK-36798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanquan Lv updated FLINK-36798: ------------------------------- Fix Version/s: cdc-3.5.0 (was: cdc-3.4.0) > Improve data processing speed during the phase from snapshot to incremental > phase > --------------------------------------------------------------------------------- > > Key: FLINK-36798 > URL: https://issues.apache.org/jira/browse/FLINK-36798 > Project: Flink > Issue Type: Improvement > Components: Flink CDC > Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1 > Reporter: Yanquan Lv > Priority: Major > Fix For: cdc-3.5.0 > > > During the phase from snapshot to incremental phase, for each input record, > we need to compare with all finished splits and find the binlog offset to > check whether we should emit the record, however, this complexity is `O(n)`, > it's a very time cost procedure. > Actually, we can improve data processing speed by the following ways: > 1. For numeric fields, we can directly calculate which chunk they belong to > based on the primary key and chunk size information.this complexity is `O(1)`. > 2. For non numeric fields, we can use binary search to find the shard to > which the data belongs. this complexity is `log(n)`. -- This message was sent by Atlassian Jira (v8.20.10#820010)