[jira] [Created] (KAFKA-17661) Fix flaky BufferPoolTest.testBlockTimeout

2024-09-30 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-17661:
---

 Summary: Fix flaky BufferPoolTest.testBlockTimeout
 Key: KAFKA-17661
 URL: https://issues.apache.org/jira/browse/KAFKA-17661
 Project: Kafka
  Issue Type: Bug
  Components: clients, unit tests
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen
 Attachments: 
0001-reproduce-racing-issue-by-adding-delay-to-test-thread.patch

4 flaky out of 221 trunk build in the past 28 days. (github) ([Report 
Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727681219558&search.startTimeMin=172520640&search.tags=github&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.clients.producer.internals.BufferPoolTest&tests.test=testBlockTimeout()])

([Sep 27 2024 at 
04:54:00|https://ge.apache.org/s/nh44u7tsm2lri/tests/task/:clients:test/details/org.apache.kafka.clients.producer.internals.BufferPoolTest/testBlockTimeout()?expanded-stacktrace=WyIwIl0&top-execution=1])
{code:java}
org.opentest4j.AssertionFailedError: The buffer allocated more memory than its 
maximum value 10 

at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
at org.junit.jupiter.api.Assertions.fail(Assertions.java:138)   
at 
org.apache.kafka.clients.producer.internals.BufferPoolTest.testBlockTimeout(BufferPoolTest.java:184)
 
at java.lang.reflect.Method.invoke(Method.java:566) 
at java.util.ArrayList.forEach(ArrayList.java:1541) 
at java.util.ArrayList.forEach(ArrayList.java:1541)
{code}
Root cause:
 # The test relies on 3 asynchronous threads being triggered in parallel with 
the test thread [1]. However, there is no guarantee of parallelism in test 
environment. The issue will happend if test thread didn't get CPU within 25 ms. 
We could reproduce this issue by adding 30 ms delay to test thread. Please 
check the attached patch.
 # Since a 25 ms delay is obviously unreliable in the test environment, we 
could consider rewriting the test or increasing the delay. (The maxBlockTimeMs 
was reduced from 2000ms to 10 ms in KAFKA-9852)

[1] 
[https://github.com/apache/kafka/blob/40360819bb97d6b05dfef6451888b4d908fc3bf4/clients/src/test/java/org/apache/kafka/clients/producer/internals/BufferPoolTest.java#L175-L179]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17697) Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask

2024-10-16 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen resolved KAFKA-17697.
-
Resolution: Duplicate

Close it as it was fixed by another PR: 
https://github.com/apache/kafka/pull/16562

> Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask
> ---
>
> Key: KAFKA-17697
> URL: https://issues.apache.org/jira/browse/KAFKA-17697
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Yu-Lin Chen
>Assignee: Yu-Lin Chen
>Priority: Major
> Attachments: 0001-reproduce-the-racing-issue.patch
>
>
> 28 flaky out of 253 trunk build in the past 28 days. (github) ([Report 
> Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727973081200&search.startTimeMin=172555200&search.tags=trunk,github&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest&tests.test=shouldRestoreSingleActiveStatefulTask()])
> The issue can be reproduced in my local env within 20 loops. Can also be 
> reproduced by the attached patch: [^0001-reproduce-the-racing-issue.patch]
>  ([Oct 2 2024 at 05:39:43 
> CST|https://ge.apache.org/s/5gsvq5esvbouc/tests/task/:streams:test/details/org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest/shouldRestoreSingleActiveStatefulTask()?expanded-stacktrace=WyIwIl0&top-execution=1])
> {code:java}
> org.opentest4j.AssertionFailedError: Condition not met within timeout 3. 
> Did not get all restored active task within the given timeout! ==> expected: 
>  but was:  
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   
> at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)   
> at 
> org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396) 
>  
> at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444)
> 
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393)   
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377)   
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:367)   
> at 
> org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.verifyRestoredActiveTasks(DefaultStateUpdaterTest.java:1715)
>   
> at 
> org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask(DefaultStateUpdaterTest.java:340)
>
> at java.lang.reflect.Method.invoke(Method.java:566)   
> at java.util.ArrayList.forEach(ArrayList.java:1541)   
> at java.util.ArrayList.forEach(ArrayList.java:1541)
> {code}
> {*}Root Casue{*}: Racing between below two threads
> 1. stateUpdater.add(task) in test thread [1]
> 2. runOnce() in StateUpdaterThread loops [2]
> In below scenario, the StateUpdaterThread hang even if there have 
> updatingTasks.
> {*}Flaky scenario{*}: If stateUpdater.add(task) ran behind the first 
> runOnce() loop, the second loop will hang. [3][4]. The allWorkDone() in the 
> second loop of runOnce() will be true[5], even if updatingTasks.isEmpty() is 
> false. [6]
> Below is the flow of the flaky scenario:
>  # runOnce() loop 1: completedChangelogs() return emptySet,
>  # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is 
> empty, allWorkDone() is false. {color:#de350b}Called 
> tasksAndActionsCondition.await(){color}. (Will be notify by 
> stateUpdater.add(task) [1][7])
>  # test thread call stateUpdater.add(task)
>  # runOnce() loop 1: allChangelogsCompleted() return false again before quit 
> the while loop. allWorkDone() is false because tasksAndActions is not empty. 
> [8]
>  # runOnce() loop 2: completedChangelogs() return 1 topic partition
>  # runOnce() loop 2: allChangelogsCompleted() return true, allWorkDone() is 
> true, {color:#de350b}call tasksAndActionsCondition.await() again{color} and 
> never be notified.
>  
> The happy path is: (stateUpdater.add(task) ran before the end of first 
> runOnce() loop)
>  # runOnce() loop 1: completedChangelogs() return emptySet,
>  # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is 
> not empty, allWorkDone() is false
>  # runOnce() loop 2: completedChangelogs() return 1 topic partition,
>  # runOnce() loop 2: allChangelogsCompleted() return false, updatingTasks is 
> not empty, allWorkDone() is false
>  # runOnce() loop 3: completedChangelogs() return 2 topic partition, 
> 

[jira] [Created] (KAFKA-17769) Fix flaky PlaintextConsumerSubscriptionTest.testSubscribeInvalidTopicCanUnsubscribe

2024-10-10 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-17769:
---

 Summary: Fix flaky 
PlaintextConsumerSubscriptionTest.testSubscribeInvalidTopicCanUnsubscribe
 Key: KAFKA-17769
 URL: https://issues.apache.org/jira/browse/KAFKA-17769
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen


4 flaky out of 110 trunk builds in past 2 weeks. ([Report 
Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1728584869905&search.startTimeMin=172615680&search.tags=trunk&search.timeZoneId=Asia%2FTaipei&tests.container=kafka.api.PlaintextConsumerSubscriptionTest&tests.test=testSubscribeInvalidTopicCanUnsubscribe(String%2C%20String)%5B3%5D])

This issue can be reproduced in my local within 50 loops.
 
([Oct 4 2024 at 10:35:49 
CST|https://ge.apache.org/s/o4ir4xtitsu52/tests/task/:core:test/details/kafka.api.PlaintextConsumerSubscriptionTest/testSubscribeInvalidTopicCanUnsubscribe(String%2C%20String)%5B3%5D?top-execution=1]):
{code:java}
org.apache.kafka.common.KafkaException: Failed to close kafka consumer  

at 
org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.close(AsyncKafkaConsumer.java:1249)
   
at 
org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.close(AsyncKafkaConsumer.java:1204)
   
at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1718)  
 
at 
kafka.api.IntegrationTestHarness.$anonfun$tearDown$3(IntegrationTestHarness.scala:249)
   
at 
kafka.api.IntegrationTestHarness.$anonfun$tearDown$3$adapted(IntegrationTestHarness.scala:249)
   
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) 
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)
at scala.collection.AbstractIterable.foreach(Iterable.scala:935)
at kafka.api.IntegrationTestHarness.tearDown(IntegrationTestHarness.scala:249)  
at java.lang.reflect.Method.invoke(Method.java:566) 
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)  
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)  
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)  
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) 
 
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658)  
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) 
 
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) 
at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)   
 
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
  
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)   
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)  
 
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) 
at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)   
 
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
  
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)   
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
at 
java.util.Arra

[jira] [Created] (KAFKA-17697) Fix flaky DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask

2024-10-03 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-17697:
---

 Summary: Fix flaky 
DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask
 Key: KAFKA-17697
 URL: https://issues.apache.org/jira/browse/KAFKA-17697
 Project: Kafka
  Issue Type: Bug
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen
 Attachments: 0001-reproduce-the-racing-issue.patch

28 flaky out of 253 trunk build in the past 28 days. (github) ([Report 
Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727973081200&search.startTimeMin=172555200&search.tags=trunk,github&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest&tests.test=shouldRestoreSingleActiveStatefulTask()])

The issue can be reproduced in my local env within 20 loops. Can also be 
reproduced by  the attached patch: 

 ([Oct 2 2024 at 05:39:43 
CST|https://ge.apache.org/s/5gsvq5esvbouc/tests/task/:streams:test/details/org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest/shouldRestoreSingleActiveStatefulTask()?expanded-stacktrace=WyIwIl0&top-execution=1])
{code:java}
org.opentest4j.AssertionFailedError: Condition not met within timeout 3. 
Did not get all restored active task within the given timeout! ==> expected: 
 but was:

at 
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)

at 
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)

at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) 
at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  
at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) 
at 
org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396)   
 
at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444) 
 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393) 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377) 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:367) 
at 
org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.verifyRestoredActiveTasks(DefaultStateUpdaterTest.java:1715)

at 
org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest.shouldRestoreSingleActiveStatefulTask(DefaultStateUpdaterTest.java:340)
 
at java.lang.reflect.Method.invoke(Method.java:566) 
at java.util.ArrayList.forEach(ArrayList.java:1541) 
at java.util.ArrayList.forEach(ArrayList.java:1541)
{code}
{*}Root Casue{*}: Racing between below two threads
1. stateUpdater.add(task) in test thread [1]
2. runOnce() in StateUpdaterThread loops [2]

In below scenario, the StateUpdaterThread hang even if there have updatingTasks.

{*}Flaky scenario{*}: If stateUpdater.add(task) ran behind the first runOnce() 
loop, the second loop will hang. [3][4]. The allWorkDone() will always be 
true[5], even if updatingTasks.isEmpty() is false. [6]

Below is the flaky flow of above situation:
 # runOnce() loop 1: completedChangelogs() return emptySet,
 # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is 
empty, allWorkDone() is false. Called tasksAndActionsCondition.await(). (Will 
be notify by stateUpdater.add(task) [1][7])
 # runOnce() loop 1: allChangelogsCompleted() return false again before quit 
the while loop. [8]
 # runOnce() loop 2: completedChangelogs() return 1 topic partition
 # runOnce() loop 2: allChangelogsCompleted() return true, call 
tasksAndActionsCondition.await() again and never be notified.

 
The happy path is: (stateUpdater.add(task) ran before the end of first 
runOnce() loop)
 # runOnce() loop 1: completedChangelogs() return emptySet,
 # runOnce() loop 1: allChangelogsCompleted() return false, updatingTasks is 
not empty, allWorkDone() is false
 # runOnce() loop 2: completedChangelogs() return 1 topic partition,
 # runOnce() loop 2: allChangelogsCompleted() return false, updatingTasks is 
not empty, allWorkDone() is false
 # runOnce() loop 3: completedChangelogs() return 2 topic partition, +move the 
task to restoredActiveTasks+ [10]
 # runOnce() loop 3: allChangelogsCompleted() return true (Doesn't matter)

 

[1] 
[https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95737df5a3/streams/src/test/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdaterTest.java#L338]
[2] 
[https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95737df5a3/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L177-L198]
[3] 
[https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95737df5a3/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L191]
[4] 
[https://github.com/apache/kafka/blob/0edf5dbd204df9eb62bfea1b56993e95

[jira] [Resolved] (KAFKA-17515) Fix flaky RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener

2024-09-25 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen resolved KAFKA-17515.
-
Resolution: Fixed

Since the Jenkins build on the trunk has been disabled today, close the second 
PR([pull/17272|https://github.com/apache/kafka/pull/]), which increased the 
timeout. I will keep monitoring the build results.

> Fix flaky 
> RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener
> --
>
> Key: KAFKA-17515
> URL: https://issues.apache.org/jira/browse/KAFKA-17515
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Chia-Ping Tsai
>Assignee: Yu-Lin Chen
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: Reproduced screenshoot in my env (Loop 7).png
>
>
> {code:java}
> Stacktrace
> java.nio.file.DirectoryNotEmptyException: 
> /tmp/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ1045955704739924-ks1/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ/0_0
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:289)
>   at 
> java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:109)
>   at java.base/java.nio.file.Files.deleteIfExists(Files.java:1191)
>   at 
> org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:898)
>   at 
> org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:870)
>   at java.base/java.nio.file.Files.walkFileTree(Files.java:2803)
>   at java.base/java.nio.file.Files.walkFileTree(Files.java:2857)
>   at org.apache.kafka.common.utils.Utils.delete(Utils.java:870)
>   at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:266)
>   at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:278)
>   at 
> org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener(RestoreIntegrationTest.java:583)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17646) Fix flaky KafkaStreamsTest.testStateGlobalThreadClose

2024-09-28 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-17646:
---

 Summary: Fix flaky KafkaStreamsTest.testStateGlobalThreadClose
 Key: KAFKA-17646
 URL: https://issues.apache.org/jira/browse/KAFKA-17646
 Project: Kafka
  Issue Type: Bug
  Components: streams, unit tests
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen


22 flaky builds out of 584 in the past 28 days. (13 from jenkins, 9 from 
github) ([Report 
Link|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1727517159423&search.startTimeMin=172503360&search.tags=trunk&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.streams.KafkaStreamsTest&tests.test=testStateGlobalThreadClose()])

Two types of error messages:
1. Unexpected state transition from ERROR to PENDING_SHUTDOWN ([Sep 24 2024 at 
15:45:33 
CST|https://ge.apache.org/s/ewblfqjcre6gu/tests/task/:streams:test/details/org.apache.kafka.streams.KafkaStreamsTest/testStateGlobalThreadClose()?expanded-stacktrace=WyIwIl0&top-execution=2])
{code:java}
java.lang.IllegalStateException: Stream-client test-client: Unexpected state 
transition from ERROR to PENDING_SHUTDOWN  
at org.apache.kafka.streams.KafkaStreams.setState(KafkaStreams.java:343)
at org.apache.kafka.streams.KafkaStreams.close(KafkaStreams.java:1566)  
at org.apache.kafka.streams.KafkaStreams.close(KafkaStreams.java:1459)  
at 
org.apache.kafka.streams.KafkaStreamsTest.testStateGlobalThreadClose(KafkaStreamsTest.java:550)
  
at java.lang.reflect.Method.invoke(Method.java:569) 
at java.util.ArrayList.forEach(ArrayList.java:1511) 
at java.util.ArrayList.forEach(ArrayList.java:1511)
{code}
2. org.opentest4j.AssertionFailedError: Condition not met within timeout 15000. 
Thread never stopped. ==> expected:  but was:  ([Sep 26 2024 at 
09:36:22 
CST|https://ge.apache.org/s/hetqkmasks5ve/tests/task/:streams:test/details/org.apache.kafka.streams.KafkaStreamsTest/testStateGlobalThreadClose()?expanded-stacktrace=WyIwIl0&top-execution=1])
{code:java}
at 
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)

at 
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)

at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) 
at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  
at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) 
at 
org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:396)   
 
at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:444) 
 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:393) 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377) 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:350) 
at 
org.apache.kafka.streams.KafkaStreamsTest.testStateGlobalThreadClose(KafkaStreamsTest.java:546)
  
at java.lang.reflect.Method.invoke(Method.java:569) 
at java.util.ArrayList.forEach(ArrayList.java:1511) 
at java.util.ArrayList.forEach(ArrayList.java:1511)
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (KAFKA-17515) Fix flaky RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener

2024-09-25 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen reopened KAFKA-17515:
-

> Fix flaky 
> RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener
> --
>
> Key: KAFKA-17515
> URL: https://issues.apache.org/jira/browse/KAFKA-17515
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Chia-Ping Tsai
>Assignee: Yu-Lin Chen
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: Reproduced screenshoot in my env (Loop 7).png
>
>
> {code:java}
> Stacktrace
> java.nio.file.DirectoryNotEmptyException: 
> /tmp/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ1045955704739924-ks1/shouldInvokeUserDefinedGlobalStateRestoreListenerH0u0n9foRY_peZu4FqeGHQ/0_0
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:289)
>   at 
> java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:109)
>   at java.base/java.nio.file.Files.deleteIfExists(Files.java:1191)
>   at 
> org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:898)
>   at 
> org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:870)
>   at java.base/java.nio.file.Files.walkFileTree(Files.java:2803)
>   at java.base/java.nio.file.Files.walkFileTree(Files.java:2857)
>   at org.apache.kafka.common.utils.Utils.delete(Utils.java:870)
>   at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:266)
>   at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:278)
>   at 
> org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener(RestoreIntegrationTest.java:583)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (KAFKA-18036) TransactionsWithTieredStoreTest testReadCommittedConsumerShouldNotSeeUndecidedData is flaky

2025-01-06 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen reopened KAFKA-18036:
-

Reopen this Jira as the issue still exists in trunk and can be reproduced 
locally within 20 loops. I'm working on troubleshooting it.

> TransactionsWithTieredStoreTest 
> testReadCommittedConsumerShouldNotSeeUndecidedData is flaky
> ---
>
> Key: KAFKA-18036
> URL: https://issues.apache.org/jira/browse/KAFKA-18036
> Project: Kafka
>  Issue Type: Test
>Reporter: David Arthur
>Assignee: Chia-Ping Tsai
>Priority: Major
>  Labels: flaky-test
>
> https://ge.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=America%2FNew_York&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest&tests.sortField=FLAKY&tests.test=testReadCommittedConsumerShouldNotSeeUndecidedData(String%2C%20String)%5B2%5D



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18770) Fix flaky initializationError in ReplicationQuotasTest/RequestQuotaTest (thread leak)

2025-02-10 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-18770:
---

 Summary: Fix flaky initializationError in 
ReplicationQuotasTest/RequestQuotaTest (thread leak)
 Key: KAFKA-18770
 URL: https://issues.apache.org/jira/browse/KAFKA-18770
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Yu-Lin Chen


An unexpected thread was detected in the recent CI.
 * ReplicationQuotasTest#initializationError
 * RequestQuotaTest#initializationError

{code:java}
org.opentest4j.AssertionFailedError: Found 1 unexpected threads during 
@BeforeAll: executor-ShareFetch ==> expected:  but was:  {code}
Failed CI:
 * [https://github.com/apache/kafka/actions/runs/13203924058]
 * [https://github.com/apache/kafka/actions/runs/13237077533/job/36944376672]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18469) AsyncConsumer fails to retry ListOffsetRequest on ReplicaNotAvailable error without metadata update

2025-01-10 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-18469:
---

 Summary: AsyncConsumer fails to retry ListOffsetRequest on 
ReplicaNotAvailable error without metadata update
 Key: KAFKA-18469
 URL: https://issues.apache.org/jira/browse/KAFKA-18469
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen


In AsyncConsumer, the ListOffsetRequest is only retried after the metadata 
update[1]. However, not every retriable errors are followed by a metadata 
update, such as the ReplicaNotAvailable error from remote storage. This errors 
leads to Consumer#offsetsForTimes failing after api timeout(60 seconds). 

This issue does not happen in ClassicConsumer. [2] We should keep the behavior 
aligned.

 

This issue is the root cause of the flaky test KAFKA-18036, where 
consumer#offsetsForTimes is called before remote metadata cache initialized.

 

[1] 
https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsRequestManager.java#L529-L534

[2] 
[https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L144-L153]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-18298) Fix flaky PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState

2025-01-12 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen resolved KAFKA-18298.
-
Resolution: Fixed

Locally validated that the issue was fixed after this PR. (No error in 200 
loops.)
 * 
[https://github.com/apache/kafka/commit/48b522fe86984db32556e48b799e8797ae2236ba]

> Fix flaky 
> PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState
> --
>
> Key: KAFKA-18298
> URL: https://issues.apache.org/jira/browse/KAFKA-18298
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Blocker
>  Labels: flaky-test, integration-test, kip-848-client-support
> Fix For: 4.0.0
>
>
> org.opentest4j.AssertionFailedError: Expected the offset for partition 0 to 
> eventually become 1. at 
> app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38) at 
> app//org.junit.jupiter.api.Assertions.fail(Assertions.java:138) at 
> app//kafka.api.PlaintextAdminIntegrationTest.testConsumerGroupsDeprecatedConsumerGroupState(PlaintextAdminIntegrationTest.scala:2360)
>  at java.base@23.0.1/java.lang.reflect.Method.invoke(Method.java:580)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-18036) TransactionsWithTieredStoreTest testReadCommittedConsumerShouldNotSeeUndecidedData is flaky

2025-01-13 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen resolved KAFKA-18036.
-
Resolution: Fixed

The fix for KAFKA-18469 has been mergerd, so this issue should be resolved.

> TransactionsWithTieredStoreTest 
> testReadCommittedConsumerShouldNotSeeUndecidedData is flaky
> ---
>
> Key: KAFKA-18036
> URL: https://issues.apache.org/jira/browse/KAFKA-18036
> Project: Kafka
>  Issue Type: Test
>Reporter: David Arthur
>Assignee: Yu-Lin Chen
>Priority: Major
>  Labels: flaky-test
> Attachments: 0001-Reproduce-Kafka-18036.patch
>
>
> https://ge.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=America%2FNew_York&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest&tests.sortField=FLAKY&tests.test=testReadCommittedConsumerShouldNotSeeUndecidedData(String%2C%20String)%5B2%5D



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (KAFKA-18298) Fix flaky PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState

2025-01-13 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen reopened KAFKA-18298:
-

Reopen this Jira becasue other flaky issues found in CI.
 # org.opentest4j.AssertionFailedError: expected: <2> but was: <3> 
([Report|https://ge.apache.org/s/ndoj6s2stb446/tests/task/:core:test/details/kafka.api.PlaintextAdminIntegrationTest/testConsumerGroupsDeprecatedConsumerGroupState(String%2C%20String)%5B1%5D?expanded-stacktrace=WyIxIl0&top-execution=2])
 # org.opentest4j.AssertionFailedError: expected:  but was:  
([Report|https://ge.apache.org/s/kh3jze2tc5qeu/tests/task/:core:test/details/kafka.api.PlaintextAdminIntegrationTest/testConsumerGroupsDeprecatedConsumerGroupState(String%2C%20String)%5B1%5D?top-execution=1])

The flaky can't be reproduce in my local, but can be simulated with the 
attached patch.  (Uncomment the sleep 1 sec)

The root cause is that the member rejoining group after it was removed from 
group. Same issue occur on KAFKA-18297 too.

> Fix flaky 
> PlaintextAdminIntegrationTest#testConsumerGroupsDeprecatedConsumerGroupState
> --
>
> Key: KAFKA-18298
> URL: https://issues.apache.org/jira/browse/KAFKA-18298
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Blocker
>  Labels: flaky-test, integration-test, kip-848-client-support
> Fix For: 4.0.0
>
>
> org.opentest4j.AssertionFailedError: Expected the offset for partition 0 to 
> eventually become 1. at 
> app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38) at 
> app//org.junit.jupiter.api.Assertions.fail(Assertions.java:138) at 
> app//kafka.api.PlaintextAdminIntegrationTest.testConsumerGroupsDeprecatedConsumerGroupState(PlaintextAdminIntegrationTest.scala:2360)
>  at java.base@23.0.1/java.lang.reflect.Method.invoke(Method.java:580)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18501) Fix flaky ClientQuotasRequestTest.testAlterClientQuotasInvalidRequests

2025-01-13 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created KAFKA-18501:
---

 Summary: Fix flaky 
ClientQuotasRequestTest.testAlterClientQuotasInvalidRequests
 Key: KAFKA-18501
 URL: https://issues.apache.org/jira/browse/KAFKA-18501
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Yu-Lin Chen


Thread leak detected in this PR:
 * 
https://github.com/apache/kafka/actions/runs/12753159081/job/35544671788#step:11:1451

 
{code:java}
org.opentest4j.AssertionFailedError: Thread leak detected: 
controller-0-ThrottledChannelReaper-Produce ==> expected:  but was: 

at 
app//org.apache.kafka.common.test.api.ClusterTestExtensions.afterEach(ClusterTestExtensions.java:158)
at 
java.base@23.0.1/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at 
java.base@23.0.1/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215)
at 
java.base@23.0.1/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:197)
at 
java.base@23.0.1/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215)
at 
java.base@23.0.1/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
at 
java.base@23.0.1/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:807)
at 
java.base@23.0.1/java.util.stream.ReferencePipeline$7$1FlatMap.accept(ReferencePipeline.java:294)
at 
java.base@23.0.1/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
at 
java.base@23.0.1/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:570)
at 
java.base@23.0.1/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
at 
java.base@23.0.1/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at 
java.base@23.0.1/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at 
java.base@23.0.1/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
at 
java.base@23.0.1/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:636)
at java.base@23.0.1/java.util.ArrayList.forEach(ArrayList.java:1597)
at java.base@23.0.1/java.util.ArrayList.forEach(ArrayList.java:1597) 
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)