[ https://issues.apache.org/jira/browse/KAFKA-14372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Kim updated KAFKA-14372: ----------------------------- Description: The default replica selector chooses a replica on whether the broker.rack matches the client.rack in the fetch request and whether the offset exists in the follower. If the follower is not in the ISR, we know it's lagging behind which will also lag the consumer behind. Let's consider two cases: # the follower recovers and joins the isr. the consumer will no longer lag # the follower continues to lag behind. after 5 minutes, the consumer will refresh the preferred read replica and it returns the same lagging follower since the offset the consumer will fetch from is capped by the follower's HWM. this can go on indefinitely If the replica selector chooses a broker in the ISR then we can ensure that at least every 5 minutes the consumer will consume from an up-to-date replica. was: The default replica selector chooses a replica solely on whether the broker.rack matches the client.rack in the fetch request (and whether the offset exists in the follower). Here's a scenario where the consumer would not be able to fetch: # Cluster is undergoing a rolling upgrade and the follower is shutting down while the consumer is fetching from follower. # The connection will gracefully shutdown and the consumer will receive an error but it will still consider this follower as the preferred read replica # At the next metadata.max.age.ms (5min default) interval, the follower will no longer be in the client's metadata so the consumer will reach out to the leader. # The leader will redirect the fetch request to the follower since the offline follower is still part of the replicas set, and no progress is made. Choosing a replica from the ISR will allow the consumer to make progress at 4. > RackAwareReplicaSelector should choose a replica from the isr > ------------------------------------------------------------- > > Key: KAFKA-14372 > URL: https://issues.apache.org/jira/browse/KAFKA-14372 > Project: Kafka > Issue Type: Bug > Reporter: Jeff Kim > Assignee: Jeff Kim > Priority: Major > > The default replica selector chooses a replica on whether the broker.rack > matches the client.rack in the fetch request and whether the offset exists in > the follower. If the follower is not in the ISR, we know it's lagging behind > which will also lag the consumer behind. Let's consider two cases: > # the follower recovers and joins the isr. the consumer will no longer lag > # the follower continues to lag behind. after 5 minutes, the consumer will > refresh the preferred read replica and it returns the same lagging follower > since the offset the consumer will fetch from is capped by the follower's > HWM. this can go on indefinitely > If the replica selector chooses a broker in the ISR then we can ensure that > at least every 5 minutes the consumer will consume from an up-to-date > replica. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)