[PR] [FLINK-40038][pipeline-connector/mysql] Cache inferred row DataType in MySQL deserialization [flink-cdc]

via GitHub Wed, 01 Jul 2026 19:06:01 -0700


monologuist opened a new pull request, #4458:
URL: https://github.com/apache/flink-cdc/pull/4458


   ## What is changed
   
   This PR reduces incremental deserialization overhead in the MySQL pipeline 
connector by caching the inferred row `DataType` per table in 
`MySqlEventDeserializer`.
   
   Previously, the deserializer inferred the row `DataType` from the Debezium 
struct for every incoming record. Under hotspot UPDATE workloads, this repeated 
inference becomes a visible CPU hotspot in the incremental synchronization path.
   
   This PR makes the following changes:
   
   - cache inferred row `DataType` by `TableId` in `MySqlEventDeserializer`
   - invalidate the cache when schema change events are observed
   - normalize table identifiers consistently in case-insensitive mode so cache 
writes and invalidation use the same key
   - add a narrow protected helper in `DebeziumEventDeserializationSchema` so 
subclasses can reuse the existing converter logic with a precomputed `DataType`
   
   ## Why
   
   In a MySQL-to-Doris pipeline with large-table hotspot UPDATE traffic, we 
observed low incremental synchronization throughput and lag accumulation 
between upstream and downstream.
   
   Based on flame graph analysis, a noticeable portion of CPU time was spent in 
the MySQL pipeline deserialization path, especially on repeated schema/data 
type inference for the same table.
   
   [FLINK-35715](https://issues.apache.org/jira/browse/FLINK-35715) addressed 
the correctness issue caused by timestamp precision inference and 
`BinaryRecordData` layout mismatch. However, repeated schema inference is still 
present on the current master branch and remains a measurable CPU hotspot in 
MySQL pipeline deserialization.
   
   This PR focuses on that remaining performance overhead.
   
   The optimization relies on a natural assumption: for the same table, the row 
`DataType` remains stable until a schema change event arrives. Based on this 
assumption, the inferred row `DataType` can be safely reused across records and 
invalidated when schema changes are observed.
   
   ## Tests
   
   The following tests were added or updated:
   
   - verify the cache lifecycle across repeated records and schema change 
invalidation
   - verify cache isolation across different tables
   - verify case-insensitive table IDs are normalized consistently for cache 
reuse and invalidation
   
   In our internal verification, this optimization improved throughput by about 
18% in a test environment and about 19% in a production-like workload.
   
   ## Notes
   
   This PR focuses on reducing redundant row type inference in the MySQL 
pipeline deserialization path. It does not change changelog semantics or 
introduce new user-facing configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-40038][pipeline-connector/mysql] Cache inferred row DataType in MySQL deserialization [flink-cdc]

Reply via email to