[ 
https://issues.apache.org/jira/browse/KAFKA-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679729#comment-17679729
 ] 

A. Sophie Blee-Goldman commented on KAFKA-14533:
------------------------------------------------

Not sure why this is happening, but the cause seems to be that the listOffsets 
request we make during the rebalance is failing:
{code:java}
[2023-01-20 23:34:28,256] WARN The listOffsets request failed. 
(org.apache.kafka.streams.processor.internals.ClientUtils:154)
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
does not host this topic-partition.
        at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
        at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
        at 
org.apache.kafka.streams.processor.internals.ClientUtils.getEndOffsets(ClientUtils.java:152)
        at 
org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.populateClientStatesMap(StreamsPartitionAssignor.java:690)
 {code}
I see that in the beginning of the logs in all the failed runs so far and none 
of the passes. Unfortunately I haven't been able to reproduce locally so I have 
to go by the Jenkins build logs, which truncate everything in the middle and 
don't show what happens next. But presumably this call is failing repeatedly, 
or potentially other calls are failing as well – whatever is wrong happens for 
10 minutes until the global timeout is triggered.

I also noticed that it seems to be almost always the stateUpdaterEnabled = true 
builds that fail, a parameter that was added just a few weeks before this 
ticket was filed. I doubt it's actually the state updater causing this, but I'm 
going to disable the `false` parameter temporarily to see if it's actually the 
"true" run that's broken or it has to do with the parametrization. Also because 
I need to get a clean build for the 3.4 release and this is failing at least 
once in almost every single build :/

> Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance
> -----------------------------------------------------------------
>
>                 Key: KAFKA-14533
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14533
>             Project: Kafka
>          Issue Type: Test
>          Components: streams, unit tests
>            Reporter: Greg Harris
>            Priority: Major
>              Labels: flaky-test
>
> The SmokeTestDriverIntegrationTest appears to be flakey failing in recent 
> runs:
> ```
>     
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1444/tests/
>         java.util.concurrent.TimeoutException: 
> shouldWorkWithRebalance(boolean) timed out after 600 seconds
>     
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1443/tests/
>         java.util.concurrent.TimeoutException: 
> shouldWorkWithRebalance(boolean) timed out after 600 seconds
>     
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1441/tests/
>         java.util.concurrent.TimeoutException: 
> shouldWorkWithRebalance(boolean) timed out after 600 seconds
>     
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1440/tests/
>         java.util.concurrent.TimeoutException: 
> shouldWorkWithRebalance(boolean) timed out after 600 seconds
>     
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1438/tests/
>         java.util.concurrent.TimeoutException: 
> shouldWorkWithRebalance(boolean) timed out after 600 seconds
>     
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1434/tests/
>         java.util.concurrent.TimeoutException: 
> shouldWorkWithRebalance(boolean) timed out after 600 seconds
> ```
> The stacktrace appears to be:
> ```
> java.util.concurrent.TimeoutException: shouldWorkWithRebalance(boolean) timed 
> out after 600 seconds
>  at 
> org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29)
>  at 
> org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58)
>  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
> ...
>  Suppressed: java.lang.InterruptedException: sleep interrupted
>  at java.lang.Thread.sleep(Native Method)
>  at 
> org.apache.kafka.streams.integration.SmokeTestDriverIntegrationTest.shouldWorkWithRebalance(SmokeTestDriverIntegrationTest.java:151)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
>  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>  at 
> org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:45)
>  ... 134 more
> ```
> The test appears to be timing out waiting for the SmokeTestClient to complete 
> its asynchronous close, and taking significantly longer to do so (600s 
> instead of 60s) than a typical local test execution time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to