[ 
https://issues.apache.org/jira/browse/FLINK-26657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617828#comment-17617828
 ] 

Danny Cranmer commented on FLINK-26657:
---------------------------------------

Resolving as "won't do" due to inactivity 

> Resilient Kinesis consumption
> -----------------------------
>
>                 Key: FLINK-26657
>                 URL: https://issues.apache.org/jira/browse/FLINK-26657
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Kinesis
>            Reporter: John Karp
>            Assignee: John Karp
>            Priority: Major
>
> Currently, any sort of error reading from a Kinesis stream will quickly 
> result in a job-killing error. If the error is not 'recoverable', failure 
> will be instant, or if it is 'recoverable', there will be a fixed number of 
> retries before the job fails -- and for some operations such as GetRecords, 
> the retries can be exhausted in just a few seconds. Furthermore, 
> KinesisProxy.isRecoverableSdkClientException() and 
> KinesisProxy.isRecoverableException() only recognize very narrow categories 
> of errors as even being recoverable.
> So for example if a Flink job is aggregating Kinesis streams from multiple 
> regions, the Flink job will not be able to make any forward progress on 
> processing data from any region if there is a single-region outage, since the 
> job will likely fail before any checkpoint can be completed. For some use 
> cases, it is better to proceed with processing most of the data, than to wait 
> indefinitely for the problematic region to recover.
> One mitigation is to increase all of the ConsumerConfig timeouts to be very 
> high. However, this will only affect error handling for 'recoverable' 
> exceptions, and depending on the nature of the regional failure, the 
> resulting errors may not be classified as 'recoverable'.
> Proposed mitigation: add a 'soft failure' mode to the Kinesis consumer, where 
> most errors arising from Kinesis operations are considered recoverable, and 
> there are unlimited retries. (Except for perhaps EFO de-registration, which 
> I'm assuming needs to complete in a timely fashion. Also, it looks like 
> ExpiredIteratorException needs to bubble up to 
> PollingRecordPublisher.getRecords() without retries.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to