[ https://issues.apache.org/jira/browse/FLINK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantin Knauf updated FLINK-24607: ------------------------------------- Component/s: (was: Runtime / Coordination) > SourceCoordinator may miss to close SplitEnumerator when failover frequently > ---------------------------------------------------------------------------- > > Key: FLINK-24607 > URL: https://issues.apache.org/jira/browse/FLINK-24607 > Project: Flink > Issue Type: Bug > Components: Connectors / Common > Affects Versions: 1.13.3 > Reporter: Jark Wu > Assignee: Jiangjie Qin > Priority: Critical > Labels: pull-request-available > Fix For: 1.15.0, 1.14.4 > > Attachments: jobmanager.log > > > We are having a connection leak problem when using mysql-cdc [1] source. We > observed that many enumerators are not closed from the JM log. > {code} > ➜ test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Restoring > SplitEnumerator" | wc -l > 264 > ➜ test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Starting split > enumerator" | wc -l > 264 > ➜ test123 cat jobmanager.log | grep "MySqlSourceEnumerator \[\] - Starting > enumerator" | wc -l > 263 > ➜ test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Closing > SourceCoordinator" | wc -l > 264 > ➜ test123 cat jobmanager.log | grep "MySqlSourceEnumerator \[\] - Closing > enumerator" | wc -l > 195 > {code} > We added "Closing enumerator" log in {{MySqlSourceEnumerator#close()}}, and > "Starting enumerator" in {{MySqlSourceEnumerator#start()}}. From the above > result you can see that SourceCoordinator is restored and closed 264 times, > split enumerator is started 264 but only closed 195 times. It seems that > {{SourceCoordinator}} misses to close enumerator when job failover > frequently. > I also went throught the code of {{SourceCoordinator}} and found some > suspicious point: > The {{started}} flag and {{enumerator}} is assigned in the main thread, > however {{SourceCoordinator#close()}} is executed async by > {{DeferrableCoordinator#closeAsync}}. That means the close method will check > the {{started}} and {{enumerator}} variable async. Is there any concurrency > problem here which mean lead to dirty read and miss to close the > {{enumerator}}? > I'm still not sure, because it's hard to reproduce locally, and we can't > deploy a custom flink version to production env. > [1]: https://github.com/ververica/flink-cdc-connectors -- This message was sent by Atlassian Jira (v8.20.1#820001)