[ https://issues.apache.org/jira/browse/KAFKA-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chirag Wadhwa resolved KAFKA-19181. ----------------------------------- Resolution: Fixed The system tests have been resolved > Investigate system test failures > -------------------------------- > > Key: KAFKA-19181 > URL: https://issues.apache.org/jira/browse/KAFKA-19181 > Project: Kafka > Issue Type: Sub-task > Reporter: Chirag Wadhwa > Assignee: Chirag Wadhwa > Priority: Major > > The nightly runs for system tests has picked failures for the following 2 > tests - > 1) test_share_multiple_partitions > 2) test_broker_failure > Investigation analysis - > 1) for the first test, 3 brokers are run, with 3 share consumers (all part of > same group). A million messages are produced to a topic with 3 partitions. > Once the messages are produced and consumed, the assertions include checks > that all consumers include some messages for all share partitions. But with > the new SimpleAssignor algorithm in place, some consumers are not assigned > some partitions, so they don't consume from those share partitions, resulting > into assertion failures. > Fix - change the test to first find the assignment of the consumers using the > kafka-share-groups.sh --describe command and only include assertions for > assigned share partitions > > 2) The bug is introduced when the coordinator is not active when an > initialize request is under process. The consumer sends heartbeats to the > broker, but none of them are successful. Initially a few of them are failing > because of {{COORDINATOR_NOT_AVAILABLE}} error. This is expected and should > be fine because this is a transient error. But during this time, the broker > keeps updating the memberEpoch for the member. But the response sent back to > the member has a memberEpoch as 0. Now I understand that client depends on > the broker's response to update its memberEpoch, and thus the subsequent > requests are also sent with memberEpoch 0. This happens for a couple of > requests (broker keeps increasing the memberEpoch but sends back a > {{COORDINATOR_NOT_AVAILABLE}} error with memberEpoch as 0 ). Finally when > the coordinator is active, as expected we get a {{FENCED_MEMBER_EPOCH}} > exception. And now, the member keeps of sending heartbeat with the wrong > memberEpoch and the broker keeps on sending back {{FENCED_MEMBER_EPOCH}} > exception. -- This message was sent by Atlassian Jira (v8.20.10#820010)