[ https://issues.apache.org/jira/browse/FLINK-10020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569661#comment-16569661 ]
ASF GitHub Bot commented on FLINK-10020: ---------------------------------------- tweise commented on a change in pull request #6482: [FLINK-10020] [kinesis] Support recoverable exceptions in listShards. URL: https://github.com/apache/flink/pull/6482#discussion_r207761087 ########## File path: flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxy.java ########## @@ -433,6 +440,16 @@ private ListShardsResult listShards(String streamName, @Nullable String startSha } catch (ExpiredNextTokenException expiredToken) { LOG.warn("List Shards has an expired token. Reusing the previous state."); break; + } catch (SdkClientException ex) { + if (isRecoverableSdkClientException(ex)) { + long backoffMillis = fullJitterBackoff( + listShardsBaseBackoffMillis, listShardsMaxBackoffMillis, listShardsExpConstant, attemptCount++); + LOG.warn("Got SdkClientException when listing shards from stream {}. Backing off for {} millis.", + streamName, backoffMillis); + Thread.sleep(backoffMillis); Review comment: Please see the JIRA for an example of such exception. These are really the same type of exceptions that we don't want getRecords to fail on and I believe we should be consistent with the backoff. Since listShards isn't latency sensitive it won't hurt to error on the conservative side. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kinesis Consumer listShards should support more recoverable exceptions > ---------------------------------------------------------------------- > > Key: FLINK-10020 > URL: https://issues.apache.org/jira/browse/FLINK-10020 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector > Reporter: Thomas Weise > Assignee: Thomas Weise > Priority: Major > Labels: pull-request-available > > Currently transient errors in listShards make the consumer fail and cause the > entire job to reset. That is unnecessary for certain exceptions (like status > 503 errors). It should be possible to control the exceptions that qualify for > retry, similar to getRecords/isRecoverableSdkClientException. -- This message was sent by Atlassian JIRA (v7.6.3#76005)