[ 
https://issues.apache.org/jira/browse/FLINK-36798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanquan Lv updated FLINK-36798:
-------------------------------
    Fix Version/s: cdc-3.5.0
                       (was: cdc-3.4.0)

> Improve data processing speed during the phase from snapshot to incremental 
> phase
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-36798
>                 URL: https://issues.apache.org/jira/browse/FLINK-36798
>             Project: Flink
>          Issue Type: Improvement
>          Components: Flink CDC
>    Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1
>            Reporter: Yanquan Lv
>            Priority: Major
>             Fix For: cdc-3.5.0
>
>
> During the phase from snapshot to incremental phase, for each input record, 
> we need to compare with all finished splits and find the binlog offset to 
> check whether we should emit the record, however,  this complexity is `O(n)`, 
> it's a very time cost procedure.
> Actually, we can improve data processing speed by the following ways:
> 1. For numeric fields, we can directly calculate which chunk they belong to 
> based on the primary key and chunk size information.this complexity is `O(1)`.
> 2. For non numeric fields, we can use binary search to find the shard to 
> which the data belongs. this complexity is `log(n)`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to