[jira] [Commented] (FLINK-26657) Resilient Kinesis consumption

Danny Cranmer (Jira) Wed, 16 Mar 2022 04:56:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507538#comment-17507538
 ]


Danny Cranmer commented on FLINK-26657:
---------------------------------------

There is a similar concept in the producer {{failOnError}}. We could introduce 
a similar configuration in the source to address this problem, which is enabled 
by default for backwards compatibility. [~jkarp] is this something you are 
looking to contribute? 
  

> Resilient Kinesis consumption
> -----------------------------
>
>                 Key: FLINK-26657
>                 URL: https://issues.apache.org/jira/browse/FLINK-26657
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Kinesis
>            Reporter: John Karp
>            Priority: Major
>
> Currently, any sort of error reading from a flink stream will quickly result 
> in a job-killing error. If the error is not 'recoverable', failure will be 
> instant, or if it is 'recoverable', there will be a fixed number of retries 
> before the job fails -- and for some operations such as GetRecords, the 
> retries can be exhausted in just a few seconds. Furthermore, 
> KinesisProxy.isRecoverableSdkClientException() and 
> KinesisProxy.isRecoverableException() only recognize very narrow categories 
> of errors as even being recoverable.
> So for example if a Flink job is aggregating Kinesis streams from multiple 
> regions, the Flink job will not be able to make any forward progress on 
> processing data from any region if there is a single-region outage, since the 
> job will likely fail before any checkpoint can be completed. For some use 
> cases, it is better to proceed with processing most of the data, than to wait 
> indefinitely for the problematic region to recover.
> One mitigation is to increase all of the ConsumerConfig timeouts to be very 
> high. However, this will only affect error handling for 'recoverable' 
> exceptions, and depending on the nature of the regional failure, the 
> resulting errors may not be classified as 'recoverable'.
> Proposed mitigation: add a 'soft failure' mode to the Kinesis consumer, where 
> most errors arising from Kinesis operations are considered recoverable, and 
> there are unlimited retries. (Except for perhaps EFO de-registration, which 
> I'm assuming needs to complete in a timely fashion. Also, it looks like 
> ExpiredIteratorException needs to bubble up to 
> PollingRecordPublisher.getRecords() without retries.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26657) Resilient Kinesis consumption

Reply via email to