[ 
https://issues.apache.org/jira/browse/FLINK-39315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39315:
-----------------------------------
    Labels: pull-request-available  (was: )

> MySql cdc connector could get stuck in backfill binlog reading after a 
> failover within snapshot phase when the MySql table is being continuously 
> written
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39315
>                 URL: https://issues.apache.org/jira/browse/FLINK-39315
>             Project: Flink
>          Issue Type: Bug
>          Components: Flink CDC
>    Affects Versions: cdc-3.4.0, cdc-3.5.0
>            Reporter: Cong Cheng
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Summary
> This issue is different from FLINK-39207. Even with FLINK-39207 fixed, the 
> MySQL CDC source can still hang during the snapshot backfill phase when 
> processing multiple snapshot splits sequentially while reusing the same 
> BinaryLogClient .
> When `MySqlSourceReader` processes multiple snapshot splits sequentially 
> (reusing the same `BinaryLogClient` across splits), the job can get stuck and 
> hang indefinitely in `SnapshotSplitReader.pollWithBuffer()` during the 
> snapshot backfill phase, waiting for `BINLOG_END` while the queue remains 
> empty.
> h3. Root Cause Analysis
>  # `SnapshotSplitReader.pollWithBuffer()` keeps polling 
> `ChangeEventQueue.poll()` until it receives the `BINLOG_END` watermark event; 
> otherwise it will wait indefinitely.
>  # The MySQL CDC implementation reuses a single `BinaryLogClient` instance 
> across split executions (via reusing `StatefulTaskContext` / 
> `MySqlTaskContextImpl`).
>  # `StatefulTaskContext.configure()` rebuilds `ChangeEventQueue` / 
> `EventDispatcher` / `SignalEventDispatcher` for each split, so each split has 
> a different target queue/dispatcher.
>  # In `MySqlStreamingChangeEventSource.execute()`, each execution registers 
> multiple `BinaryLogClient` event/lifecycle listeners (e.g. the main event 
> listener, lifecycle listener, `onEvent`, debug listener), but the 
> implementation does not unregister these listeners when the execution 
> finishes (it only disconnects the client).
>  # Therefore, listeners from previous splits accumulate and remain active 
> when later splits start the backfill binlog reading.
>  # During backfill of a later split, binlog events will still trigger 
> callbacks of an old split’s listener/task. When the current binlog offset 
> advances to a point that satisfies the old split’s stop condition (e.g. 
> `currentOffset >= oldEndingOffset`, common under continuous writes), the old 
> listener can:
>  ## stop the shared `StoppableChangeEventSourceContext` (by calling 
> `stopChangeEventSource()`), causing the *current* split’s backfill loop to 
> exit prematurely; and/or
>  ## dispatch `BINLOG_END` via the old `SignalEventDispatcher` into the old 
> `ChangeEventQueue` (because the old listener holds the old dispatcher/queue 
> created in its split’s `StatefulTaskContext.configure()`).
>  # As a result, the current `pollWithBuffer()` is polling the new queue and 
> never receives `BINLOG_END`, while the backfill thread has already stopped 
> without surfacing an exception, leading to a deadlock/hang.
> h3. Steps to Reproduce
>  # Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set 
> to a large value to ensure a snapshot split is time consuming to read;
>  # Keep continuous writes on the MySql source table so binlog offsets advance;
>  # Trigger a TaskManager failover while the job is in the snapshot phase;
>  # Observe that the job hangs after processing the first split.
> A regression test is also included in the PR (e.g., 
> SnapshotSplitReaderTest#testMultipleSplitsWithBackfill ). It reproduces the 
> hang on the buggy version if the fix code splits are commented out and passes 
> with the fix applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to