Cong Cheng created FLINK-39315:
----------------------------------

             Summary: MySql cdc connector could get stuck in backfill binlog 
reading after a failover within snapshot phase when the MySql table is being 
continuously written
                 Key: FLINK-39315
                 URL: https://issues.apache.org/jira/browse/FLINK-39315
             Project: Flink
          Issue Type: Bug
          Components: Flink CDC
    Affects Versions: cdc-3.5.0, cdc-3.4.0
            Reporter: Cong Cheng


h3. Summary

This issue is different from FLINK-39207. Even with FLINK-39207 fixed, the 
MySQL CDC source can still hang during the snapshot backfill phase when 
processing multiple snapshot splits sequentially while reusing the same 
BinaryLogClient .
When `MySqlSourceReader` processes multiple snapshot splits sequentially 
(reusing the same `BinaryLogClient` across splits), the job can get stuck and 
hang indefinitely in `SnapshotSplitReader.pollWithBuffer()` during the snapshot 
backfill phase, waiting for `BINLOG_END` while the queue remains empty.
h3. Root Cause Analysis
 # `SnapshotSplitReader.pollWithBuffer()` keeps polling 
`ChangeEventQueue.poll()` until it receives the `BINLOG_END` watermark event; 
otherwise it will wait indefinitely.
 # The MySQL CDC implementation reuses a single `BinaryLogClient` instance 
across split executions (via reusing `StatefulTaskContext` / 
`MySqlTaskContextImpl`).
 # `StatefulTaskContext.configure()` rebuilds `ChangeEventQueue` / 
`EventDispatcher` / `SignalEventDispatcher` for each split, so each split has a 
different target queue/dispatcher.
 # In `MySqlStreamingChangeEventSource.execute()`, each execution registers 
multiple `BinaryLogClient` event/lifecycle listeners (e.g. the main event 
listener, lifecycle listener, `onEvent`, debug listener), but the 
implementation does not unregister these listeners when the execution finishes 
(it only disconnects the client).
 # Therefore, listeners from previous splits accumulate and remain active when 
later splits start the backfill binlog reading.
 # During backfill of a later split, binlog events will still trigger callbacks 
of an old split’s listener/task. When the current binlog offset advances to a 
point that satisfies the old split’s stop condition (e.g. `currentOffset >= 
oldEndingOffset`, common under continuous writes), the old listener can:
 ## stop the shared `StoppableChangeEventSourceContext` (by calling 
`stopChangeEventSource()`), causing the *current* split’s backfill loop to exit 
prematurely; and/or
 ## dispatch `BINLOG_END` via the old `SignalEventDispatcher` into the old 
`ChangeEventQueue` (because the old listener holds the old dispatcher/queue 
created in its split’s `StatefulTaskContext.configure()`).
 # As a result, the current `pollWithBuffer()` is polling the new queue and 
never receives `BINLOG_END`, while the backfill thread has already stopped 
without surfacing an exception, leading to a deadlock/hang.

h3. Steps to Reproduce
 # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set 
to a large value to ensure a snapshot split is time consuming to read;
 # Keep continuous writes on the MySql source table so binlog offsets advance;
 # Trigger a TaskManager failover while the job is in the snapshot phase;
 # Observe that the job hangs after processing the first split.

A regression test is also included in the PR (e.g., 
SnapshotSplitReaderTest#testMultipleSplitsWithBackfill ). It reproduces the 
hang on the buggy version if the fix code splits are commented out and passes 
with the fix applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to