Lianet Magrans created KAFKA-18569: -------------------------------------- Summary: New consumer close may wait on unneeded FindCoordinator Key: KAFKA-18569 URL: https://issues.apache.org/jira/browse/KAFKA-18569 Project: Kafka Issue Type: Bug Components: consumer Reporter: Lianet Magrans Fix For: 4.0.0
A flaky test revealed that the new consumer close may wait for a FindCoordinator unsent request to go out when closing the consumer, even after the commit/leaveGroup stages of close are done. This could happen because the CoordinatorRequestManager poll continues to attempt FindCoordinator if the coordinator is unknown , even if this happens during consumer close, after the consumer has completed the commit/leave attempts (which are the only steps in close that require a coordinator), and before the network shutdown that stops polling managers here [https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java#L1343] If the unneeded FindCoordinator is generated and the brokers are down (like could happen in the flaky test), the consumer would wait for that request unnecessarily here [https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerNetworkThread.java#L327] I expect we shouldn't block the close on a FindCoordinator request if the consumer already completed the commit/leave attempts. An option could be to consider "signal close" to the CoordinatorRequestManager after the consumer.close completes commit/leave, so that it does not generate any more requests on poll (similar to what is already done for the CommitRequestManager with the CommitOnCloseEvent [https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java#L1335] This fix should allow to enable this test for the new consumer reliably. https://github.com/apache/kafka/blob/5c20aa187aa8f51af4270d7d1b0db4963b0cd10b/core/src/test/scala/integration/kafka/api/ConsumerBounceTest.scala#L404 Without the fix, the test is flaky (fails locally after a few repeated runs, fails in CI). -- This message was sent by Atlassian Jira (v8.20.10#820010)