[ https://issues.apache.org/jira/browse/FLINK-26657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507538#comment-17507538 ]
Danny Cranmer commented on FLINK-26657: --------------------------------------- There is a similar concept in the producer {{failOnError}}. We could introduce a similar configuration in the source to address this problem, which is enabled by default for backwards compatibility. [~jkarp] is this something you are looking to contribute? > Resilient Kinesis consumption > ----------------------------- > > Key: FLINK-26657 > URL: https://issues.apache.org/jira/browse/FLINK-26657 > Project: Flink > Issue Type: Improvement > Components: Connectors / Kinesis > Reporter: John Karp > Priority: Major > > Currently, any sort of error reading from a flink stream will quickly result > in a job-killing error. If the error is not 'recoverable', failure will be > instant, or if it is 'recoverable', there will be a fixed number of retries > before the job fails -- and for some operations such as GetRecords, the > retries can be exhausted in just a few seconds. Furthermore, > KinesisProxy.isRecoverableSdkClientException() and > KinesisProxy.isRecoverableException() only recognize very narrow categories > of errors as even being recoverable. > So for example if a Flink job is aggregating Kinesis streams from multiple > regions, the Flink job will not be able to make any forward progress on > processing data from any region if there is a single-region outage, since the > job will likely fail before any checkpoint can be completed. For some use > cases, it is better to proceed with processing most of the data, than to wait > indefinitely for the problematic region to recover. > One mitigation is to increase all of the ConsumerConfig timeouts to be very > high. However, this will only affect error handling for 'recoverable' > exceptions, and depending on the nature of the regional failure, the > resulting errors may not be classified as 'recoverable'. > Proposed mitigation: add a 'soft failure' mode to the Kinesis consumer, where > most errors arising from Kinesis operations are considered recoverable, and > there are unlimited retries. (Except for perhaps EFO de-registration, which > I'm assuming needs to complete in a timely fashion. Also, it looks like > ExpiredIteratorException needs to bubble up to > PollingRecordPublisher.getRecords() without retries.) -- This message was sent by Atlassian Jira (v8.20.1#820001)