Hong Liang Teoh created FLINK-37417:
---------------------------------------

             Summary: Implement better exception message to explain why 
KinesisSource doesn't support partial recovery
                 Key: FLINK-37417
                 URL: https://issues.apache.org/jira/browse/FLINK-37417
             Project: Flink
          Issue Type: Improvement
            Reporter: Hong Liang Teoh


The KinesisSource doesn't currently support partial recovery because we want to 
prevent duplicate records being read into KinesisSource when restart + partial 
failure happens.

 

Partial recovery is when Flink restarts a subset of subtasks to minimize 
downtime during a job failure. Flink determines which subtasks need to be 
restarted by checking all connected subtasks to the failed subtasks. 

However, this algorithm doesn't work for a KDS source, because the 
KinesisSource:
 # Has parent-child shard ordering within the KDS stream.
 # Does not ensure that all child shards are assigned to the same subtask (to 
prevent skew)

These two mean that when we restart "selected" subtasks, we cannot ensure that 
the child shards have not been assigned to other subtasks "not connected" to 
the selected subtask.

 

This is very much an edge case, where users will have to be doing the following:
 # NOT have a keyBy immediately after the KDS source.
 # Partial failure happens on an operator BEFORE the first keyBy in the job 
graph.

 

This JIRA is to enrich the Exception message thrown

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to