[ https://issues.apache.org/jira/browse/FLINK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-37627: ----------------------------------- Labels: pull-request-available (was: ) > Restarting from a checkpoint/savepoint which coincides with shard split > causes data loss > ---------------------------------------------------------------------------------------- > > Key: FLINK-37627 > URL: https://issues.apache.org/jira/browse/FLINK-37627 > Project: Flink > Issue Type: Bug > Components: Connectors / Kinesis > Affects Versions: aws-connector-5.0.0 > Reporter: Keith Lee > Priority: Major > Labels: pull-request-available > > Similar to DDB stream connector's issue > https://issues.apache.org/jira/browse/FLINK-37416 > This is less likely to happen on Kinesis connector due to much lower > frequency of re-sharding / assigning new split but technically possible so > we'd like to fix this to avoid data > loss. > The scenario is as follow: > - A checkpoint started > - KinesisStreamsSourceEnumerator takes a checkpoint (shard was assigned here) > - KinesisStreamsSourceEnumerator sends checkpoint event to reader > - Before taking reader checkpoint, a SplitFinishedEvent came up in reader > - Reader takes checkpoint > - Now, just after checkpoint complete, job restarted > This can lead to a shard lineage getting lost because of a shard being in > ASSIGNED state in enumerator and not being part of any task manager state. > See DDB Connector issue's PR for reference fix: > https://issues.apache.org/jira/browse/FLINK-37416 -- This message was sent by Atlassian Jira (v8.20.10#820010)