Zhongmin Qiao created FLINK-35874: ------------------------------------- Summary: Check pureBinlogPhaseTables set before call getBinlogPosition method in BinlogSplitReader Key: FLINK-35874 URL: https://issues.apache.org/jira/browse/FLINK-35874 Project: Flink Issue Type: Improvement Components: Flink CDC Reporter: Zhongmin Qiao Attachments: image-2024-07-22-19-26-59-158.png, image-2024-07-22-19-27-19-366.png, image-2024-07-22-19-30-08-989.png, image-2024-07-22-19-36-20-481.png, image-2024-07-22-19-36-40-581.png, image-2024-07-22-19-37-35-542.png, image-2024-07-22-21-12-03-316.png
The method getBinlogPosition of RecordUtil which is called by BinlogSplitReader. shouldEmit is a highly performance-consuming method. This is because it iterates through the sourceOffset map of the SourceRecord, and during the iteration, it also performs a toString() conversion on the value. Finally, it calls the putAll method of BinlogOffsetBuilder to put all the elements obtained from the iteration into the offsetMap (which involves another map traversal and hashcode computation). Despite the significant performance impact of getBinlogPosition, we still need to call it when emitting each DataChangeRecord, which reduces the efficiency of data processing in Flink CDC. !image-2024-07-22-19-26-59-158.png|width=545,height=222! !image-2024-07-22-19-27-19-366.png|width=545,height=119! However, we can optimize and avoid frequent invocations of getBinlogPosition by moving the check pureBinlogPhaseTables.contains(tableId) in the hasEnterPureBinlogPhase method before calling getBinlogPosition. This way, if the SourceRecord belongs to a pure binlog phase table, we can directly return true without the need for the highly performance-consuming getBinlogPosition method. diff !image-2024-07-22-21-12-03-316.png|width=548,height=236! -- This message was sent by Atlassian Jira (v8.20.10#820010)