[ 
https://issues.apache.org/jira/browse/KAFKA-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-9815.
------------------------------------
    Fix Version/s: 2.4.2
                   2.5.0
       Resolution: Fixed

> Consumer may never re-join if inconsistent metadata is received once
> --------------------------------------------------------------------
>
>                 Key: KAFKA-9815
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9815
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>            Reporter: Rajini Sivaram
>            Assignee: Rajini Sivaram
>            Priority: Major
>             Fix For: 2.5.0, 2.4.2
>
>
> KAFKA-9797 is the result of an incorrect rolling upgrade test where a new 
> listener is added to brokers and set as the inter-broker listener within the 
> same rolling upgrade. As a result, metadata is inconsistent across brokers 
> until the rolling upgrade completes since interbroker communication is broken 
> until all brokers have the new listener. The test fails due to consumer 
> timeouts and sometimes this is because the upgrade takes longer than consumer 
> timeout. But several logs show an issue with the consumer when one metadata 
> response received during upgrade is different from the consumer's cached 
> `assignmentSnapshot`, triggering rebalance.
> In 
> [https://github.com/apache/kafka/blob/7f640f13b4d486477035c3edb28466734f053beb/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L750,]
>  we return true for `rejoinNeededOrPending()` if `assignmentSnapshot` is not 
> the same as the current `metadataSnapshot`. We don't set `rejoinNeeded` in 
> the instance, but we revoke partitions and send JoinGroup request. If the 
> JoinGroup request fails and a subsequent metadata response contains the same 
> snapshot value as the previously cached `assignmentSnapshot`, we never send 
> `JoinGroup` again since snapshots match and `rejoinNeeded=false`. Partitions 
> are not assigned to the consumer after this and the test fails because 
> messages are not received.
> Even though this particular system test isn't a valid upgrade scenario, we 
> should fix the consumer, since temporary metadata differences between brokers 
> can result in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to