Ihor Mielientiev created FLINK-37065:
----------------------------------------
Summary: MySQL cdc can lose/skip data during recovering from the
checkpoint
Key: FLINK-37065
URL: https://issues.apache.org/jira/browse/FLINK-37065
Project: Flink
Issue Type: Bug
Affects Versions: cdc-3.2.1, cdc-3.2.0
Reporter: Ihor Mielientiev
During each pipeline start (e.g., failover or restart), the Flink CDC connector
retrieves the current GTID sets from the MySQL server and merges it with the
pipeline's current state. This merged GTID set is then sent back to the MySQL
server to indicate which transactions the Flink pipeline has already processed,
ensuring that the server doesn’t resend processed transactions.
Flink CDC MySQL Connector uses the
[fixRestoredGtidSet|https://github.com/apache/flink-cdc/blob/master/flink-cdc-connect/flink-cdc-source-connectors/flink-connector-mysql-cdc/src/main/java/io/debezium/connector/mysql/GtidUtils.java#L36]
method to merge the GTID sets from the server with the GTID sets from the
checkpoint. The method ensures that Flink will "tell" MySQL to skip over
transactions it has already processed, avoiding duplication. However, the
current implementation of this method doesn’t handle gaps caused by MySQL
parallel execution. For example, if the restored GTID set is 1-80:83-90:92-98
and the server GTID set is 1-100, the method will merge gaps together and
result will be 1-98, since it selects the highest gtid from checkpoint
So in case if the pipeline has been restarted during any “gaps”, Flink CDC
won’t process “gapped” transactions and will lose the data
--
This message was sent by Atlassian Jira
(v8.20.10#820010)