kirktrue commented on code in PR #19980: URL: https://github.com/apache/kafka/pull/19980#discussion_r2298969509
########## clients/clients-integration-tests/src/test/java/org/apache/kafka/clients/consumer/PlaintextConsumerTest.java: ########## @@ -1588,6 +1593,103 @@ private void sendCompressedMessages(int numRecords, TopicPartition tp) { } } + // Override the default test timeout of 60 seconds since the core logic of the test allows up to 60 seconds to + // pass to detect the issue. + @Timeout(75) + @ClusterTest + public void testClassicConsumerStallBetweenPoll() throws Exception { + testStallBetweenPoll(GroupProtocol.CLASSIC); + } + + // Override the default test timeout of 60 seconds since the core logic of the test allows up to 60 seconds to + // pass to detect the issue. + @Timeout(75) + @ClusterTest + public void testAsyncConsumerStallBetweenPoll() throws Exception { + testStallBetweenPoll(GroupProtocol.CONSUMER); + } + + /** + * This test is to prove that the intermittent stalling that has been experienced when using the asynchronous + * consumer, as filed under KAFKA-19259, have been fixed. + * + * <p/> + * + * The basic idea is to have one thread that produces a record every 500 ms. and the main thread that consumes + * records without pausing between polls for much more than the produce delay. In the test case filed in + * KAFKA-19259, the consumer sometimes pauses for up to 5-10 seconds despite records being produced every + * quarter second. + */ + private void testStallBetweenPoll(GroupProtocol groupProtocol) throws Exception { + var testTopic = "stutter-test-topic"; + var numPartitions = 6; + cluster.createTopic(testTopic, numPartitions, (short) BROKER_COUNT); + + // Give the test one minute to detect a stall. + var testTimeout = 60000; + + // The producer must produce slowly to tickle the scenario. + var produceWait = 500; + + // Assign a tolerance for how much time is allowed to pass between Consumer.poll() calls given that there + // should be *at least* one record to read every second. + var delayTolerance = produceWait * 2; + + try (var producer = cluster.producer()) { + // Start a thread running that produces records at a relative trickle. + var producerThread = new Thread(() -> { + while (true) { + try { + Utils.sleep(produceWait); + producer.send(new ProducerRecord<>(testTopic, TestUtils.randomBytes(64))).get(); + } catch (InterruptedException e) { + break; + } catch (Exception e) { + throw new RuntimeException(e); + } + } + }); + producerThread.start(); + + Map<String, Object> consumerConfig = Map.of(GROUP_PROTOCOL_CONFIG, groupProtocol.name().toLowerCase(Locale.ROOT)); + + try (Consumer<byte[], byte[]> consumer = cluster.consumer(consumerConfig)) { + consumer.subscribe(List.of(testTopic)); + + // This is just to wait until the group membership and assignment is in place. + awaitNonEmptyRecords(consumer, new TopicPartition(testTopic, 0)); + + var testTimer = Time.SYSTEM.timer(testTimeout); + + // Keep track of the last time the poll is invoked to ensure the deltas between invocations don't + // exceed the delay threshold defined above. + var lastPoll = System.currentTimeMillis(); + + while (testTimer.notExpired()) { Review Comment: I refactored it to use just two calls to `Consumer.poll()`. I temporarily removed the call to `FetchBuffer.wakeup()` from `AbstractFetch` and ran the test 100x and it fails consistently, as expected. I put the call to `wakeup()` back in and ran another 100x and the passes consistently. Thanks for the suggestion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org