Keith Lee created FLINK-37627: --------------------------------- Summary: Restarting from a checkpoint/savepoint which coincides with shard split causes data loss Key: FLINK-37627 URL: https://issues.apache.org/jira/browse/FLINK-37627 Project: Flink Issue Type: Bug Components: Connectors / Kinesis Affects Versions: aws-connector-5.0.0 Reporter: Keith Lee
Similar to DDB stream connector's issue https://issues.apache.org/jira/browse/FLINK-37416 This is less likely to happen on Kinesis connector due to much lower frequency of re-sharding / assigning new split but technically possible so we'd like to fix this to avoid data loss. The scenario is as follow: - A checkpoint started - KinesisStreamsSourceEnumerator takes a checkpoint (shard was assigned here) - KinesisStreamsSourceEnumerator sends checkpoint event to reader - Before taking reader checkpoint, a SplitFinishedEvent came up in reader - Reader takes checkpoint - Now, just after checkpoint complete, job restarted This can lead to a shard lineage getting lost because of a shard being in ASSIGNED state in enumerator and not being part of any task manager state. See DDB Connector issue's PR for reference fix: https://issues.apache.org/jira/browse/FLINK-37416 -- This message was sent by Atlassian Jira (v8.20.10#820010)