[ https://issues.apache.org/jira/browse/KAFKA-18040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903787#comment-17903787 ]
Kirk True commented on KAFKA-18040: ----------------------------------- In this test there are two brokers: a leader and a follower. The purpose of this test is to verify that when the follower fails, things continue to work. To that end, the test _intentionally_ shuts down the follower. For a little bit of background, when the consumer sends the {{FIND_COORDINATOR}} request to the cluster, the response will include the ID of the broker selected as the group coordinator. Given that this test only has two brokers, the group coordinator is either the leader or the follower. The broker that is selected for each test appears to be "random" in that it's sometimes the leader and sometimes the follower. What I've found so far in my investigation: when the test runs and the broker selected as the group coordinator is the {_}leader{_}, the test passes. When the test runs and the broker selected is the {_}follower{_}, the test fails. The consumer can detect when it is disconnected from the group coordinator, and it will then send another {{FIND_COORDINATOR}} request to one of the remaining nodes in the cluster to ask for the new coordinator. In my investigation, the client is able to successfully connect to another broker (the leader), but the broker repeatedly responds with the {{COORDINATOR_NOT_AVAILABLE}} error. I increased the timeout for the test from 15 seconds to 60 seconds, but this did not give the broker enough time to "fail over" to a new coordinator. > PlaintextProducerSendTest.testSendToPartitionWithFollowerShutdownShouldNotTimeout > fails with CONSUMER group protocol > -------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-18040 > URL: https://issues.apache.org/jira/browse/KAFKA-18040 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Affects Versions: 3.9.0 > Reporter: Kirk True > Assignee: Kirk True > Priority: Major > Labels: integration-test, kip-848-client-support > Fix For: 4.0.0 > > > I am getting an error ({{Consumed 0 records before timeout instead of the > expected 100}}) when running this test with the {{CONSUMER}} group protocol. > When I run the test using the {{CLASSIC}} group protocol, I can run it 25 > times in a row without an error. So it doesn't appear to be flaky > (KAFKA-16176). When I run it with the {{CONSUMER}} group protocol, it fails > 16 out of the 25 runs :( > -- This message was sent by Atlassian Jira (v8.20.10#820010)