Hong Liang Teoh created FLINK-37417: ---------------------------------------
Summary: Implement better exception message to explain why KinesisSource doesn't support partial recovery Key: FLINK-37417 URL: https://issues.apache.org/jira/browse/FLINK-37417 Project: Flink Issue Type: Improvement Reporter: Hong Liang Teoh The KinesisSource doesn't currently support partial recovery because we want to prevent duplicate records being read into KinesisSource when restart + partial failure happens. Partial recovery is when Flink restarts a subset of subtasks to minimize downtime during a job failure. Flink determines which subtasks need to be restarted by checking all connected subtasks to the failed subtasks. However, this algorithm doesn't work for a KDS source, because the KinesisSource: # Has parent-child shard ordering within the KDS stream. # Does not ensure that all child shards are assigned to the same subtask (to prevent skew) These two mean that when we restart "selected" subtasks, we cannot ensure that the child shards have not been assigned to other subtasks "not connected" to the selected subtask. This is very much an edge case, where users will have to be doing the following: # NOT have a keyBy immediately after the KDS source. # Partial failure happens on an operator BEFORE the first keyBy in the job graph. This JIRA is to enrich the Exception message thrown -- This message was sent by Atlassian Jira (v8.20.10#820010)