[ https://issues.apache.org/jira/browse/FLINK-26657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617828#comment-17617828 ]
Danny Cranmer commented on FLINK-26657: --------------------------------------- Resolving as "won't do" due to inactivity > Resilient Kinesis consumption > ----------------------------- > > Key: FLINK-26657 > URL: https://issues.apache.org/jira/browse/FLINK-26657 > Project: Flink > Issue Type: Improvement > Components: Connectors / Kinesis > Reporter: John Karp > Assignee: John Karp > Priority: Major > > Currently, any sort of error reading from a Kinesis stream will quickly > result in a job-killing error. If the error is not 'recoverable', failure > will be instant, or if it is 'recoverable', there will be a fixed number of > retries before the job fails -- and for some operations such as GetRecords, > the retries can be exhausted in just a few seconds. Furthermore, > KinesisProxy.isRecoverableSdkClientException() and > KinesisProxy.isRecoverableException() only recognize very narrow categories > of errors as even being recoverable. > So for example if a Flink job is aggregating Kinesis streams from multiple > regions, the Flink job will not be able to make any forward progress on > processing data from any region if there is a single-region outage, since the > job will likely fail before any checkpoint can be completed. For some use > cases, it is better to proceed with processing most of the data, than to wait > indefinitely for the problematic region to recover. > One mitigation is to increase all of the ConsumerConfig timeouts to be very > high. However, this will only affect error handling for 'recoverable' > exceptions, and depending on the nature of the regional failure, the > resulting errors may not be classified as 'recoverable'. > Proposed mitigation: add a 'soft failure' mode to the Kinesis consumer, where > most errors arising from Kinesis operations are considered recoverable, and > there are unlimited retries. (Except for perhaps EFO de-registration, which > I'm assuming needs to complete in a timely fashion. Also, it looks like > ExpiredIteratorException needs to bubble up to > PollingRecordPublisher.getRecords() without retries.) -- This message was sent by Atlassian Jira (v8.20.10#820010)