Yanquan Lv created FLINK-36798:
----------------------------------
Summary: Improve data processing speed during the phase from
snapshot to incremental phase
Key: FLINK-36798
URL: https://issues.apache.org/jira/browse/FLINK-36798
Project: Flink
Issue Type: Improvement
Components: Flink CDC
Affects Versions: cdc-3.1.1, cdc-3.2.0, cdc-3.1.0
Reporter: Yanquan Lv
Fix For: cdc-3.3.0
During the phase from snapshot to incremental phase, for each input record, we
need to compare with all finished splits and find the binlog offset to check
whether we should emit the record, however, this complexity is log(n), it's a
very time cost procedure.
Actually, we can improve data processing speed by the following ways:
1. For numeric fields, we can directly calculate which chunk they belong to
based on the primary key and chunk size information.this complexity is O(1).
2. For non numeric fields, we can use binary search to find the shard to which
the data belongs. this complexity is log(n).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)