Yanquan Lv created FLINK-36798:
----------------------------------

             Summary: Improve data processing speed during the phase from 
snapshot to incremental phase
                 Key: FLINK-36798
                 URL: https://issues.apache.org/jira/browse/FLINK-36798
             Project: Flink
          Issue Type: Improvement
          Components: Flink CDC
    Affects Versions: cdc-3.1.1, cdc-3.2.0, cdc-3.1.0
            Reporter: Yanquan Lv
             Fix For: cdc-3.3.0


During the phase from snapshot to incremental phase, for each input record, we 
need to compare with all finished splits and find the binlog offset to check 
whether we should emit the record, however,  this complexity is log(n), it's a 
very time cost procedure.

Actually, we can improve data processing speed by the following ways:
1. For numeric fields, we can directly calculate which chunk they belong to 
based on the primary key and chunk size information.this complexity is O(1).
2. For non numeric fields, we can use binary search to find the shard to which 
the data belongs. this complexity is log(n).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to