[ 
https://issues.apache.org/jira/browse/KAFKA-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chirag Wadhwa resolved KAFKA-19181.
-----------------------------------
    Resolution: Fixed

The system tests have been resolved

> Investigate system test failures
> --------------------------------
>
>                 Key: KAFKA-19181
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19181
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Chirag Wadhwa
>            Assignee: Chirag Wadhwa
>            Priority: Major
>
> The nightly runs for system tests has picked failures for the following 2 
> tests - 
> 1) test_share_multiple_partitions 
> 2) test_broker_failure
> Investigation analysis - 
> 1) for the first test, 3 brokers are run, with 3 share consumers (all part of 
> same group). A million messages are produced to a topic with 3 partitions. 
> Once the messages are produced and consumed, the assertions include checks 
> that all consumers include some messages for all share partitions. But with 
> the new SimpleAssignor algorithm in place, some consumers are not assigned 
> some partitions, so they don't consume from those share partitions, resulting 
> into assertion failures.
> Fix - change the test to first find the assignment of the consumers using the 
> kafka-share-groups.sh --describe command and only include assertions for 
> assigned share partitions
>  
> 2) The bug is introduced when the coordinator is not active when an 
> initialize request is under process. The consumer sends heartbeats to the 
> broker, but none of them are successful. Initially a few of them are failing 
> because of {{COORDINATOR_NOT_AVAILABLE}} error. This is expected and should 
> be fine because this is a transient error. But during this time, the broker 
> keeps updating the memberEpoch for the member. But the response sent back to 
> the member has a memberEpoch as 0. Now I understand that client depends on 
> the broker's response to update its memberEpoch, and thus the subsequent 
> requests are also sent with memberEpoch 0. This happens for a couple of 
> requests (broker keeps increasing the memberEpoch but sends back a 
> {{COORDINATOR_NOT_AVAILABLE}}  error with memberEpoch as 0 ). Finally when 
> the coordinator is active, as expected we get a {{FENCED_MEMBER_EPOCH}} 
> exception. And now, the member keeps of sending heartbeat with the wrong 
> memberEpoch and the broker keeps on sending back {{FENCED_MEMBER_EPOCH}}  
> exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to