Cong Cheng created FLINK-39315:
----------------------------------
Summary: MySql cdc connector could get stuck in backfill binlog
reading after a failover within snapshot phase when the MySql table is being
continuously written
Key: FLINK-39315
URL: https://issues.apache.org/jira/browse/FLINK-39315
Project: Flink
Issue Type: Bug
Components: Flink CDC
Affects Versions: cdc-3.5.0, cdc-3.4.0
Reporter: Cong Cheng
h3. Summary
This issue is different from FLINK-39207. Even with FLINK-39207 fixed, the
MySQL CDC source can still hang during the snapshot backfill phase when
processing multiple snapshot splits sequentially while reusing the same
BinaryLogClient .
When `MySqlSourceReader` processes multiple snapshot splits sequentially
(reusing the same `BinaryLogClient` across splits), the job can get stuck and
hang indefinitely in `SnapshotSplitReader.pollWithBuffer()` during the snapshot
backfill phase, waiting for `BINLOG_END` while the queue remains empty.
h3. Root Cause Analysis
# `SnapshotSplitReader.pollWithBuffer()` keeps polling
`ChangeEventQueue.poll()` until it receives the `BINLOG_END` watermark event;
otherwise it will wait indefinitely.
# The MySQL CDC implementation reuses a single `BinaryLogClient` instance
across split executions (via reusing `StatefulTaskContext` /
`MySqlTaskContextImpl`).
# `StatefulTaskContext.configure()` rebuilds `ChangeEventQueue` /
`EventDispatcher` / `SignalEventDispatcher` for each split, so each split has a
different target queue/dispatcher.
# In `MySqlStreamingChangeEventSource.execute()`, each execution registers
multiple `BinaryLogClient` event/lifecycle listeners (e.g. the main event
listener, lifecycle listener, `onEvent`, debug listener), but the
implementation does not unregister these listeners when the execution finishes
(it only disconnects the client).
# Therefore, listeners from previous splits accumulate and remain active when
later splits start the backfill binlog reading.
# During backfill of a later split, binlog events will still trigger callbacks
of an old split’s listener/task. When the current binlog offset advances to a
point that satisfies the old split’s stop condition (e.g. `currentOffset >=
oldEndingOffset`, common under continuous writes), the old listener can:
## stop the shared `StoppableChangeEventSourceContext` (by calling
`stopChangeEventSource()`), causing the *current* split’s backfill loop to exit
prematurely; and/or
## dispatch `BINLOG_END` via the old `SignalEventDispatcher` into the old
`ChangeEventQueue` (because the old listener holds the old dispatcher/queue
created in its split’s `StatefulTaskContext.configure()`).
# As a result, the current `pollWithBuffer()` is polling the new queue and
never receives `BINLOG_END`, while the backfill thread has already stopped
without surfacing an exception, leading to a deadlock/hang.
h3. Steps to Reproduce
# Configure a MySQL CDC Source with scan.incremental.snapshot.chunk.size set
to a large value to ensure a snapshot split is time consuming to read;
# Keep continuous writes on the MySql source table so binlog offsets advance;
# Trigger a TaskManager failover while the job is in the snapshot phase;
# Observe that the job hangs after processing the first split.
A regression test is also included in the PR (e.g.,
SnapshotSplitReaderTest#testMultipleSplitsWithBackfill ). It reproduces the
hang on the buggy version if the fix code splits are commented out and passes
with the fix applied.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)