[jira] [Created] (KAFKA-17661) Fix flaky BufferPoolTest.testBlockTimeout
Yu-Lin Chen created KAFKA-17661: --- Summary: Fix flaky BufferPoolTest.testBlockTimeout Key: KAFKA-17661 URL: https://issues.apache.org/jira/browse/KAFKA-17661 Project: Kafka Issue Type: Bug Components: clients, unit tests Reporter: Yu-Lin Chen Assignee: Yu-Lin Chen Attachments: 0001-reproduce-racing-issue-by-adding-delay-to-test-thread.patch 4 flaky out of 221 trunk build in the past 28 days. (github) ([Report Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727681219558&search.startTimeMin=172520640&search.tags=github&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.clients.producer.internals.BufferPoolTest&tests.test=testBlockTimeout()]) ([Sep 27 2024 at 04:54:00|https://ge.apache.org/s/nh44u7tsm2lri/tests/task/:clients:test/details/org.apache.kafka.clients.producer.internals.BufferPoolTest/testBlockTimeout()?expanded-stacktrace=WyIwIl0&top-execution=1]) {code:java} org.opentest4j.AssertionFailedError: The buffer allocated more memory than its maximum value 10 at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38) at org.junit.jupiter.api.Assertions.fail(Assertions.java:138) at org.apache.kafka.clients.producer.internals.BufferPoolTest.testBlockTimeout(BufferPoolTest.java:184) at java.lang.reflect.Method.invoke(Method.java:566) at java.util.ArrayList.forEach(ArrayList.java:1541) at java.util.ArrayList.forEach(ArrayList.java:1541) {code} Root cause: # The test relies on 3 asynchronous threads being triggered in parallel with the test thread [1]. However, there is no guarantee of parallelism in test environment. The issue will happend if test thread didn't get CPU within 25 ms. We could reproduce this issue by adding 30 ms delay to test thread. Please check the attached patch. # Since a 25 ms delay is obviously unreliable in the test environment, we could consider rewriting the test or increasing the delay. (The maxBlockTimeMs was reduced from 2000ms to 10 ms in KAFKA-9852) [1] [https://github.com/apache/kafka/blob/40360819bb97d6b05dfef6451888b4d908fc3bf4/clients/src/test/java/org/apache/kafka/clients/producer/internals/BufferPoolTest.java#L175-L179] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-17697) Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask
[ https://issues.apache.org/jira/browse/KAFKA-17697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen resolved KAFKA-17697. - Resolution: Duplicate Close it as it was fixed by another PR: https://github.com/apache/kafka/pull/16562 > Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask > --- > > Key: KAFKA-17697 > URL: https://issues.apache.org/jira/browse/KAFKA-17697 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Reporter: Yu-Lin Chen >Assignee: Yu-Lin Chen >Priority: Major > Attachments: 0001-reproduce-the-racing-issue.patch > > > 28 flaky out of 253 trunk build in the past 28 days. (github) ([Report > Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727973081200&search.startTimeMin=172555200&search.tags=trunk,github&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest&tests.test=shouldRestoreSingleActiveStatefulTask()]) > The issue can be reproduced in my local env within 20 loops. Can also be > reproduced by the attached patch: [^0001-reproduce-the-racing-issue.patch] > ([Oct 2 2024 at 05:39:43 > CST|https://ge.apache.org/s/5gsvq5esvbouc/tests/task/:streams:test/details/org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest/shouldRestoreSingleActiveStatefulTask()?expanded-stacktrace=WyIwIl0&top-execution=1]) > {code:java} > org.opentest4j.AssertionFailedError: Condition not met within timeout 3. > Did not get all restored active task within the given timeout! ==> expected: > but was: > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > > at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) > at > org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396) > > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444) > > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393) > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377) > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:367) > at > org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.verifyRestoredActiveTasks(DefaultStateUpdaterTest.java:1715) > > at > org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask(DefaultStateUpdaterTest.java:340) > > at java.lang.reflect.Method.invoke(Method.java:566) > at java.util.ArrayList.forEach(ArrayList.java:1541) > at java.util.ArrayList.forEach(ArrayList.java:1541) > {code} > {*}Root Casue{*}: Racing between below two threads > 1. stateUpdater.add(task) in test thread [1] > 2. runOnce() in StateUpdaterThread loops [2] > In below scenario, the StateUpdaterThread hang even if there have > updatingTasks. > {*}Flaky scenario{*}: If stateUpdater.add(task) ran behind the first > runOnce() loop, the second loop will hang. [3][4]. The allWorkDone() in the > second loop of runOnce() will be true[5], even if updatingTasks.isEmpty() is > false. [6] > Below is the flow of the flaky scenario: > # runOnce() loop 1: completedChangelogs() return emptySet, > # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is > empty, allWorkDone() is false. {color:#de350b}Called > tasksAndActionsCondition.await(){color}. (Will be notify by > stateUpdater.add(task) [1][7]) > # test thread call stateUpdater.add(task) > # runOnce() loop 1: allChangelogsCompleted() return false again before quit > the while loop. allWorkDone() is false because tasksAndActions is not empty. > [8] > # runOnce() loop 2: completedChangelogs() return 1 topic partition > # runOnce() loop 2: allChangelogsCompleted() return true, allWorkDone() is > true, {color:#de350b}call tasksAndActionsCondition.await() again{color} and > never be notified. > > The happy path is: (stateUpdater.add(task) ran before the end of first > runOnce() loop) > # runOnce() loop 1: completedChangelogs() return emptySet, > # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is > not empty, allWorkDone() is false > # runOnce() loop 2: completedChangelogs() return 1 topic partition, > # runOnce() loop 2: allChangelogsCompleted() return false, updatingTasks is > not empty, allWorkDone() is false > # runOnce() loop 3: completedChangelogs() return 2 topic partition, >
[jira] [Created] (KAFKA-17769) Fix flaky PlaintextConsumerSubscriptionTest.testSubscribeInvalidTopicCanUnsubscribe
Yu-Lin Chen created KAFKA-17769: --- Summary: Fix flaky PlaintextConsumerSubscriptionTest.testSubscribeInvalidTopicCanUnsubscribe Key: KAFKA-17769 URL: https://issues.apache.org/jira/browse/KAFKA-17769 Project: Kafka Issue Type: Bug Components: clients, consumer Reporter: Yu-Lin Chen Assignee: Yu-Lin Chen 4 flaky out of 110 trunk builds in past 2 weeks. ([Report Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1728584869905&search.startTimeMin=172615680&search.tags=trunk&search.timeZoneId=Asia%2FTaipei&tests.container=kafka.api.PlaintextConsumerSubscriptionTest&tests.test=testSubscribeInvalidTopicCanUnsubscribe(String%2C%20String)%5B3%5D]) This issue can be reproduced in my local within 50 loops. ([Oct 4 2024 at 10:35:49 CST|https://ge.apache.org/s/o4ir4xtitsu52/tests/task/:core:test/details/kafka.api.PlaintextConsumerSubscriptionTest/testSubscribeInvalidTopicCanUnsubscribe(String%2C%20String)%5B3%5D?top-execution=1]): {code:java} org.apache.kafka.common.KafkaException: Failed to close kafka consumer at org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.close(AsyncKafkaConsumer.java:1249) at org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.close(AsyncKafkaConsumer.java:1204) at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1718) at kafka.api.IntegrationTestHarness.$anonfun$tearDown$3(IntegrationTestHarness.scala:249) at kafka.api.IntegrationTestHarness.$anonfun$tearDown$3$adapted(IntegrationTestHarness.scala:249) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617) at scala.collection.AbstractIterable.foreach(Iterable.scala:935) at kafka.api.IntegrationTestHarness.tearDown(IntegrationTestHarness.scala:249) at java.lang.reflect.Method.invoke(Method.java:566) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.util.Arra
[jira] [Created] (KAFKA-17697) Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask
Yu-Lin Chen created KAFKA-17697: --- Summary: Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask Key: KAFKA-17697 URL: https://issues.apache.org/jira/browse/KAFKA-17697 Project: Kafka Issue Type: Bug Reporter: Yu-Lin Chen Assignee: Yu-Lin Chen Attachments: 0001-reproduce-the-racing-issue.patch 28 flaky out of 253 trunk build in the past 28 days. (github) ([Report Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727973081200&search.startTimeMin=172555200&search.tags=trunk,github&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest&tests.test=shouldRestoreSingleActiveStatefulTask()]) The issue can be reproduced in my local env within 20 loops. Can also be reproduced by the attached patch: ([Oct 2 2024 at 05:39:43 CST|https://ge.apache.org/s/5gsvq5esvbouc/tests/task/:streams:test/details/org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest/shouldRestoreSingleActiveStatefulTask()?expanded-stacktrace=WyIwIl0&top-execution=1]) {code:java} org.opentest4j.AssertionFailedError: Condition not met within timeout 3. Did not get all restored active task within the given timeout! ==> expected: but was: at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) at org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396) at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:367) at org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.verifyRestoredActiveTasks(DefaultStateUpdaterTest.java:1715) at org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask(DefaultStateUpdaterTest.java:340) at java.lang.reflect.Method.invoke(Method.java:566) at java.util.ArrayList.forEach(ArrayList.java:1541) at java.util.ArrayList.forEach(ArrayList.java:1541) {code} {*}Root Casue{*}: Racing between below two threads 1. stateUpdater.add(task) in test thread [1] 2. runOnce() in StateUpdaterThread loops [2] In below scenario, the StateUpdaterThread hang even if there have updatingTasks. {*}Flaky scenario{*}: If stateUpdater.add(task) ran behind the first runOnce() loop, the second loop will hang. [3][4]. The allWorkDone() will always be true[5], even if updatingTasks.isEmpty() is false. [6] Below is the flaky flow of above situation: # runOnce() loop 1: completedChangelogs() return emptySet, # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is empty, allWorkDone() is false. Called tasksAndActionsCondition.await(). (Will be notify by stateUpdater.add(task) [1][7]) # runOnce() loop 1: allChangelogsCompleted() return false again before quit the while loop. [8] # runOnce() loop 2: completedChangelogs() return 1 topic partition # runOnce() loop 2: allChangelogsCompleted() return true, call tasksAndActionsCondition.await() again and never be notified. The happy path is: (stateUpdater.add(task) ran before the end of first runOnce() loop) # runOnce() loop 1: completedChangelogs() return emptySet, # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is not empty, allWorkDone() is false # runOnce() loop 2: completedChangelogs() return 1 topic partition, # runOnce() loop 2: allChangelogsCompleted() return false, updatingTasks is not empty, allWorkDone() is false # runOnce() loop 3: completedChangelogs() return 2 topic partition, +move the task to restoredActiveTasks+ [10] # runOnce() loop 3: allChangelogsCompleted() return true (Doesn't matter) [1] [https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95737df5a3/streams/src/test/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdaterTest.java#L338] [2] [https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95737df5a3/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L177-L198] [3] [https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95737df5a3/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L191] [4] [https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95
[jira] [Resolved] (KAFKA-17515) Fix flaky RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener
[ https://issues.apache.org/jira/browse/KAFKA-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen resolved KAFKA-17515. - Resolution: Fixed Since the Jenkins build on the trunk has been disabled today, close the second PR([pull/17272|https://github.com/apache/kafka/pull/]), which increased the timeout. I will keep monitoring the build results. > Fix flaky > RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener > -- > > Key: KAFKA-17515 > URL: https://issues.apache.org/jira/browse/KAFKA-17515 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Reporter: Chia-Ping Tsai >Assignee: Yu-Lin Chen >Priority: Major > Fix For: 4.0.0 > > Attachments: Reproduced screenshoot in my env (Loop 7).png > > > {code:java} > Stacktrace > java.nio.file.DirectoryNotEmptyException: > /tmp/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ1045955704739924-ks1/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ/0_0 > at > java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:289) > at > java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:109) > at java.base/java.nio.file.Files.deleteIfExists(Files.java:1191) > at > org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:898) > at > org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:870) > at java.base/java.nio.file.Files.walkFileTree(Files.java:2803) > at java.base/java.nio.file.Files.walkFileTree(Files.java:2857) > at org.apache.kafka.common.utils.Utils.delete(Utils.java:870) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:266) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:278) > at > org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener(RestoreIntegrationTest.java:583) > at java.base/java.lang.reflect.Method.invoke(Method.java:580) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-17646) Fix flaky KafkaStreamsTest.testStateGlobalThreadClose
Yu-Lin Chen created KAFKA-17646: --- Summary: Fix flaky KafkaStreamsTest.testStateGlobalThreadClose Key: KAFKA-17646 URL: https://issues.apache.org/jira/browse/KAFKA-17646 Project: Kafka Issue Type: Bug Components: streams, unit tests Reporter: Yu-Lin Chen Assignee: Yu-Lin Chen 22 flaky builds out of 584 in the past 28 days. (13 from jenkins, 9 from github) ([Report Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727517159423&search.startTimeMin=172503360&search.tags=trunk&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.streams.KafkaStreamsTest&tests.test=testStateGlobalThreadClose()]) Two types of error messages: 1. Unexpected state transition from ERROR to PENDING_SHUTDOWN ([Sep 24 2024 at 15:45:33 CST|https://ge.apache.org/s/ewblfqjcre6gu/tests/task/:streams:test/details/org.apache.kafka.streams.KafkaStreamsTest/testStateGlobalThreadClose()?expanded-stacktrace=WyIwIl0&top-execution=2]) {code:java} java.lang.IllegalStateException: Stream-client test-client: Unexpected state transition from ERROR to PENDING_SHUTDOWN at org.apache.kafka.streams.KafkaStreams.setState(KafkaStreams.java:343) at org.apache.kafka.streams.KafkaStreams.close(KafkaStreams.java:1566) at org.apache.kafka.streams.KafkaStreams.close(KafkaStreams.java:1459) at org.apache.kafka.streams.KafkaStreamsTest.testStateGlobalThreadClose(KafkaStreamsTest.java:550) at java.lang.reflect.Method.invoke(Method.java:569) at java.util.ArrayList.forEach(ArrayList.java:1511) at java.util.ArrayList.forEach(ArrayList.java:1511) {code} 2. org.opentest4j.AssertionFailedError: Condition not met within timeout 15000. Thread never stopped. ==> expected: but was: ([Sep 26 2024 at 09:36:22 CST|https://ge.apache.org/s/hetqkmasks5ve/tests/task/:streams:test/details/org.apache.kafka.streams.KafkaStreamsTest/testStateGlobalThreadClose()?expanded-stacktrace=WyIwIl0&top-execution=1]) {code:java} at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) at org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396) at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:350) at org.apache.kafka.streams.KafkaStreamsTest.testStateGlobalThreadClose(KafkaStreamsTest.java:546) at java.lang.reflect.Method.invoke(Method.java:569) at java.util.ArrayList.forEach(ArrayList.java:1511) at java.util.ArrayList.forEach(ArrayList.java:1511) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (KAFKA-17515) Fix flaky RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener
[ https://issues.apache.org/jira/browse/KAFKA-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen reopened KAFKA-17515: - > Fix flaky > RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener > -- > > Key: KAFKA-17515 > URL: https://issues.apache.org/jira/browse/KAFKA-17515 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Reporter: Chia-Ping Tsai >Assignee: Yu-Lin Chen >Priority: Major > Fix For: 4.0.0 > > Attachments: Reproduced screenshoot in my env (Loop 7).png > > > {code:java} > Stacktrace > java.nio.file.DirectoryNotEmptyException: > /tmp/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ1045955704739924-ks1/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ/0_0 > at > java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:289) > at > java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:109) > at java.base/java.nio.file.Files.deleteIfExists(Files.java:1191) > at > org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:898) > at > org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:870) > at java.base/java.nio.file.Files.walkFileTree(Files.java:2803) > at java.base/java.nio.file.Files.walkFileTree(Files.java:2857) > at org.apache.kafka.common.utils.Utils.delete(Utils.java:870) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:266) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:278) > at > org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener(RestoreIntegrationTest.java:583) > at java.base/java.lang.reflect.Method.invoke(Method.java:580) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (KAFKA-18036) TransactionsWithTieredStoreTest testReadCommittedConsumerShouldNotSeeUndecidedData is flaky
[ https://issues.apache.org/jira/browse/KAFKA-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen reopened KAFKA-18036: - Reopen this Jira as the issue still exists in trunk and can be reproduced locally within 20 loops. I'm working on troubleshooting it. > TransactionsWithTieredStoreTest > testReadCommittedConsumerShouldNotSeeUndecidedData is flaky > --- > > Key: KAFKA-18036 > URL: https://issues.apache.org/jira/browse/KAFKA-18036 > Project: Kafka > Issue Type: Test >Reporter: David Arthur >Assignee: Chia-Ping Tsai >Priority: Major > Labels: flaky-test > > https://ge.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=America%2FNew_York&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest&tests.sortField=FLAKY&tests.test=testReadCommittedConsumerShouldNotSeeUndecidedData(String%2C%20String)%5B2%5D -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-18770) Fix flaky initializationError in ReplicationQuotasTest/RequestQuotaTest (thread leak)
Yu-Lin Chen created KAFKA-18770: --- Summary: Fix flaky initializationError in ReplicationQuotasTest/RequestQuotaTest (thread leak) Key: KAFKA-18770 URL: https://issues.apache.org/jira/browse/KAFKA-18770 Project: Kafka Issue Type: Bug Components: core Reporter: Yu-Lin Chen An unexpected thread was detected in the recent CI. * ReplicationQuotasTest#initializationError * RequestQuotaTest#initializationError {code:java} org.opentest4j.AssertionFailedError: Found 1 unexpected threads during @BeforeAll: executor-ShareFetch ==> expected: but was: {code} Failed CI: * [https://github.com/apache/kafka/actions/runs/13203924058] * [https://github.com/apache/kafka/actions/runs/13237077533/job/36944376672] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-18469) AsyncConsumer fails to retry ListOffsetRequest on ReplicaNotAvailable error without metadata update
Yu-Lin Chen created KAFKA-18469: --- Summary: AsyncConsumer fails to retry ListOffsetRequest on ReplicaNotAvailable error without metadata update Key: KAFKA-18469 URL: https://issues.apache.org/jira/browse/KAFKA-18469 Project: Kafka Issue Type: Bug Components: clients, consumer Reporter: Yu-Lin Chen Assignee: Yu-Lin Chen In AsyncConsumer, the ListOffsetRequest is only retried after the metadata update[1]. However, not every retriable errors are followed by a metadata update, such as the ReplicaNotAvailable error from remote storage. This errors leads to Consumer#offsetsForTimes failing after api timeout(60 seconds). This issue does not happen in ClassicConsumer. [2] We should keep the behavior aligned. This issue is the root cause of the flaky test KAFKA-18036, where consumer#offsetsForTimes is called before remote metadata cache initialized. [1] https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java#L529-L534 [2] [https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L144-L153] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-18298) Fix flaky PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState
[ https://issues.apache.org/jira/browse/KAFKA-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen resolved KAFKA-18298. - Resolution: Fixed Locally validated that the issue was fixed after this PR. (No error in 200 loops.) * [https://github.com/apache/kafka/commit/48b522fe86984db32556e48b799e8797ae2236ba] > Fix flaky > PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState > -- > > Key: KAFKA-18298 > URL: https://issues.apache.org/jira/browse/KAFKA-18298 > Project: Kafka > Issue Type: Bug > Components: clients, consumer >Reporter: Chia-Ping Tsai >Assignee: Chia-Ping Tsai >Priority: Blocker > Labels: flaky-test, integration-test, kip-848-client-support > Fix For: 4.0.0 > > > org.opentest4j.AssertionFailedError: Expected the offset for partition 0 to > eventually become 1. at > app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38) at > app//org.junit.jupiter.api.Assertions.fail(Assertions.java:138) at > app//kafka.api.PlaintextAdminIntegrationTest.testConsumerGroupsDeprecatedConsumerGroupState(PlaintextAdminIntegrationTest.scala:2360) > at java.base@23.0.1/java.lang.reflect.Method.invoke(Method.java:580) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-18036) TransactionsWithTieredStoreTest testReadCommittedConsumerShouldNotSeeUndecidedData is flaky
[ https://issues.apache.org/jira/browse/KAFKA-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen resolved KAFKA-18036. - Resolution: Fixed The fix for KAFKA-18469 has been mergerd, so this issue should be resolved. > TransactionsWithTieredStoreTest > testReadCommittedConsumerShouldNotSeeUndecidedData is flaky > --- > > Key: KAFKA-18036 > URL: https://issues.apache.org/jira/browse/KAFKA-18036 > Project: Kafka > Issue Type: Test >Reporter: David Arthur >Assignee: Yu-Lin Chen >Priority: Major > Labels: flaky-test > Attachments: 0001-Reproduce-Kafka-18036.patch > > > https://ge.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=America%2FNew_York&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest&tests.sortField=FLAKY&tests.test=testReadCommittedConsumerShouldNotSeeUndecidedData(String%2C%20String)%5B2%5D -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (KAFKA-18298) Fix flaky PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState
[ https://issues.apache.org/jira/browse/KAFKA-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen reopened KAFKA-18298: - Reopen this Jira becasue other flaky issues found in CI. # org.opentest4j.AssertionFailedError: expected: <2> but was: <3> ([Report|https://ge.apache.org/s/ndoj6s2stb446/tests/task/:core:test/details/kafka.api.PlaintextAdminIntegrationTest/testConsumerGroupsDeprecatedConsumerGroupState(String%2C%20String)%5B1%5D?expanded-stacktrace=WyIxIl0&top-execution=2]) # org.opentest4j.AssertionFailedError: expected: but was: ([Report|https://ge.apache.org/s/kh3jze2tc5qeu/tests/task/:core:test/details/kafka.api.PlaintextAdminIntegrationTest/testConsumerGroupsDeprecatedConsumerGroupState(String%2C%20String)%5B1%5D?top-execution=1]) The flaky can't be reproduce in my local, but can be simulated with the attached patch. (Uncomment the sleep 1 sec) The root cause is that the member rejoining group after it was removed from group. Same issue occur on KAFKA-18297 too. > Fix flaky > PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState > -- > > Key: KAFKA-18298 > URL: https://issues.apache.org/jira/browse/KAFKA-18298 > Project: Kafka > Issue Type: Bug > Components: clients, consumer >Reporter: Chia-Ping Tsai >Assignee: Chia-Ping Tsai >Priority: Blocker > Labels: flaky-test, integration-test, kip-848-client-support > Fix For: 4.0.0 > > > org.opentest4j.AssertionFailedError: Expected the offset for partition 0 to > eventually become 1. at > app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38) at > app//org.junit.jupiter.api.Assertions.fail(Assertions.java:138) at > app//kafka.api.PlaintextAdminIntegrationTest.testConsumerGroupsDeprecatedConsumerGroupState(PlaintextAdminIntegrationTest.scala:2360) > at java.base@23.0.1/java.lang.reflect.Method.invoke(Method.java:580) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-18501) Fix flaky ClientQuotasRequestTest.testAlterClientQuotasInvalidRequests
Yu-Lin Chen created KAFKA-18501: --- Summary: Fix flaky ClientQuotasRequestTest.testAlterClientQuotasInvalidRequests Key: KAFKA-18501 URL: https://issues.apache.org/jira/browse/KAFKA-18501 Project: Kafka Issue Type: Bug Components: unit tests Reporter: Yu-Lin Chen Thread leak detected in this PR: * https://github.com/apache/kafka/actions/runs/12753159081/job/35544671788#step:11:1451 {code:java} org.opentest4j.AssertionFailedError: Thread leak detected: controller-0-ThrottledChannelReaper-Produce ==> expected: but was: at app//org.apache.kafka.common.test.api.ClusterTestExtensions.afterEach(ClusterTestExtensions.java:158) at java.base@23.0.1/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.base@23.0.1/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215) at java.base@23.0.1/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:197) at java.base@23.0.1/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215) at java.base@23.0.1/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709) at java.base@23.0.1/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:807) at java.base@23.0.1/java.util.stream.ReferencePipeline$7$1FlatMap.accept(ReferencePipeline.java:294) at java.base@23.0.1/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709) at java.base@23.0.1/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:570) at java.base@23.0.1/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560) at java.base@23.0.1/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.base@23.0.1/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.base@23.0.1/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265) at java.base@23.0.1/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:636) at java.base@23.0.1/java.util.ArrayList.forEach(ArrayList.java:1597) at java.base@23.0.1/java.util.ArrayList.forEach(ArrayList.java:1597) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)