[ https://issues.apache.org/jira/browse/KAFKA-18469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yu-Lin Chen updated KAFKA-18469: -------------------------------- Description: In AsyncConsumer, the ListOffsetRequest is only retried after the metadata update[1]. However, not every retriable errors are followed by a metadata update, such as the ReplicaNotAvailable error from remote storage. This errors leads to Consumer#offsetsForTimes failing after api timeout(60 seconds). This issue does not occur with ClassicConsumer, which always triggers a metadata update before retrying. [2] This issue is the root cause of the flaky test KAFKA-18036, where consumer#offsetsForTimes is called before remote metadata cache initialized. [1] [https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java#L529-L534] [2] https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L180 was: In AsyncConsumer, the ListOffsetRequest is only retried after the metadata update[1]. However, not every retriable errors are followed by a metadata update, such as the ReplicaNotAvailable error from remote storage. This errors leads to Consumer#offsetsForTimes failing after api timeout(60 seconds). This issue does not happen in ClassicConsumer. [2] We should keep the behavior aligned. This issue is the root cause of the flaky test KAFKA-18036, where consumer#offsetsForTimes is called before remote metadata cache initialized. [1] https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java#L529-L534 [2] [https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L144-L153] > AsyncConsumer fails to retry ListOffsetRequest on ReplicaNotAvailable error > without metadata update > --------------------------------------------------------------------------------------------------- > > Key: KAFKA-18469 > URL: https://issues.apache.org/jira/browse/KAFKA-18469 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Reporter: Yu-Lin Chen > Assignee: Yu-Lin Chen > Priority: Major > > In AsyncConsumer, the ListOffsetRequest is only retried after the metadata > update[1]. However, not every retriable errors are followed by a metadata > update, such as the ReplicaNotAvailable error from remote storage. This > errors leads to Consumer#offsetsForTimes failing after api timeout(60 > seconds). > This issue does not occur with ClassicConsumer, which always triggers a > metadata update before retrying. [2] > > This issue is the root cause of the flaky test KAFKA-18036, where > consumer#offsetsForTimes is called before remote metadata cache initialized. > > [1] > [https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java#L529-L534] > [2] > https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L180 -- This message was sent by Atlassian Jira (v8.20.10#820010)