[ 
https://issues.apache.org/jira/browse/KAFKA-18040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903787#comment-17903787
 ] 

Kirk True commented on KAFKA-18040:
-----------------------------------

In this test there are two brokers: a leader and a follower. The purpose of 
this test is to verify that when the follower fails, things continue to work. 
To that end, the test _intentionally_ shuts down the follower.

For a little bit of background, when the consumer sends the 
{{FIND_COORDINATOR}} request to the cluster, the response will include the ID 
of the broker selected as the group coordinator. Given that this test only has 
two brokers, the group coordinator is either the leader or the follower. The 
broker that is selected for each test appears to be "random" in that it's 
sometimes the leader and sometimes the follower.

What I've found so far in my investigation: when the test runs and the broker 
selected as the group coordinator is the {_}leader{_}, the test passes. When 
the test runs and the broker selected is the {_}follower{_}, the test fails.

The consumer can detect when it is disconnected from the group coordinator, and 
it will then send another {{FIND_COORDINATOR}} request to one of the remaining 
nodes in the cluster to ask for the new coordinator. In my investigation, the 
client is able to successfully connect to another broker (the leader), but the 
broker repeatedly responds with the {{COORDINATOR_NOT_AVAILABLE}} error. I 
increased the timeout for the test from 15 seconds to 60 seconds, but this did 
not give the broker enough time to "fail over" to a new coordinator.

> PlaintextProducerSendTest.testSendToPartitionWithFollowerShutdownShouldNotTimeout
>  fails with CONSUMER group protocol
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-18040
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18040
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 3.9.0
>            Reporter: Kirk True
>            Assignee: Kirk True
>            Priority: Major
>              Labels: integration-test, kip-848-client-support
>             Fix For: 4.0.0
>
>
> I am getting an error ({{Consumed 0 records before timeout instead of the 
> expected 100}}) when running this test with the {{CONSUMER}} group protocol.
> When I run the test using the {{CLASSIC}} group protocol, I can run it 25 
> times in a row without an error. So it doesn't appear to be flaky 
> (KAFKA-16176). When I run it with the {{CONSUMER}} group protocol, it fails 
> 16 out of the 25 runs :(
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to