[ 
https://issues.apache.org/jira/browse/KAFKA-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937328#comment-17937328
 ] 

PoAn Yang commented on KAFKA-18981:
-----------------------------------

The root cause of this flaky test is that: if broker 1 doesn't get heartbeat 
promptly and it's fenced after the topic creation, the broker 1 cannot be ISR. 
The session timeout is 300ms. Following logs are fromĀ 
[https://develocity.apache.org/s/tjs4dzxiphmwc/tests/task/:metadata:test/details/org.apache.kafka.controller.QuorumControllerTest/testMinIsrUpdateWithElr()/1/output]:

{noformat}
...
[2025-03-18 07:14:45,121] DEBUG [QuorumController id=0] Processed 
processBrokerHeartbeat(1474681863) in 140186 us 
(org.apache.kafka.controller.QuorumController:542) <-- heartbeat for broker 1
...
[2025-03-18 07:14:45,147] DEBUG [QuorumController id=0] Processed 
processBrokerHeartbeat(399642596) in 10723 us 
(org.apache.kafka.controller.QuorumController:542) <-- heartbeat for broker 2
...
[2025-03-18 07:14:45,172] DEBUG [QuorumController id=0] Processed 
processBrokerHeartbeat(168471181) in 13236 us 
(org.apache.kafka.controller.QuorumController:542) <-- heartbeat for broker 3
...
[2025-03-18 07:14:45,288] INFO [QuorumController id=0] CreateTopics result(s): 
CreatableTopic(name='foo', ...) 
(org.apache.kafka.metalog.LocalLogManager$SharedLogData:258) <-- topic creation
...
[2025-03-18 07:14:45,455] INFO [QuorumController id=0] Fencing broker 1 at 
epoch 6 because its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager:1693) <-- broker 1 
session timeout
{noformat}

At 07:14:45,121, the broker 1 gets heartbeat and it's active. However, it 
doesn't get another heartbeat before 07:14:45,421, so it's fenced at 
07:14:45,455.

We can reproduce this by adding Thread.sleep(300) just after active.creatTopics 
[0]. IMO, to solve the root cause, we can use another thread to send heartbeat 
request, so broker 1 doesn't have chance to get fenced.

[0] 
https://github.com/apache/kafka/blob/1ded681684e771b16aa98ae751f39b9816345a83/metadata/src/test/java/org/apache/kafka/controller/QuorumControllerTest.java#L663-L665

> Fix flaky QuorumControllerTest.testMinIsrUpdateWithElr
> ------------------------------------------------------
>
>                 Key: KAFKA-18981
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18981
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Chia-Ping Tsai
>            Assignee: Chia-Ping Tsai
>            Priority: Major
>
> {code:java}
> org.opentest4j.AssertionFailedError: PartitionRegistration(replicas=[3, 1, 
> 2], directories=[vFsJEjZDRlONBTr8543h6A, gQ7fQdQaTFmyno0jsZ2LVA, 
> 4OWYkhmOTO2eaTFzTyiMEg], isr=[], removingReplicas=[], addingReplicas=[], 
> elr=[3], lastKnownElr=[3], leader=-1, leaderRecoveryState=RECOVERED, 
> leaderEpoch=1, partitionEpoch=3) ==> array lengths differ, expected: <1> but 
> was: <0>
>       at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>       at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>       at 
> app//org.junit.jupiter.api.AssertArrayEquals.assertArraysHaveSameLength(AssertArrayEquals.java:428)
>       at 
> app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:237)
>       at 
> app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:87)
>       at 
> app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1290)
>       at 
> app//org.apache.kafka.controller.QuorumControllerTest.testMinIsrUpdateWithElr(QuorumControllerTest.java:699)
>       at java.base@17.0.14/java.lang.reflect.Method.invoke(Method.java:569)
>       at java.base@17.0.14/java.util.ArrayList.forEach(ArrayList.java:1511)
>       at java.base@17.0.14/java.util.ArrayList.forEach(ArrayList.java:1511)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to