Yanquan Lv created FLINK-36798: ---------------------------------- Summary: Improve data processing speed during the phase from snapshot to incremental phase Key: FLINK-36798 URL: https://issues.apache.org/jira/browse/FLINK-36798 Project: Flink Issue Type: Improvement Components: Flink CDC Affects Versions: cdc-3.1.1, cdc-3.2.0, cdc-3.1.0 Reporter: Yanquan Lv Fix For: cdc-3.3.0
During the phase from snapshot to incremental phase, for each input record, we need to compare with all finished splits and find the binlog offset to check whether we should emit the record, however, this complexity is log(n), it's a very time cost procedure. Actually, we can improve data processing speed by the following ways: 1. For numeric fields, we can directly calculate which chunk they belong to based on the primary key and chunk size information.this complexity is O(1). 2. For non numeric fields, we can use binary search to find the shard to which the data belongs. this complexity is log(n). -- This message was sent by Atlassian Jira (v8.20.10#820010)