[ https://issues.apache.org/jira/browse/KAFKA-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937328#comment-17937328 ]
PoAn Yang commented on KAFKA-18981: ----------------------------------- The root cause of this flaky test is that: if broker 1 doesn't get heartbeat promptly and it's fenced after the topic creation, the broker 1 cannot be ISR. The session timeout is 300ms. Following logs are fromĀ [https://develocity.apache.org/s/tjs4dzxiphmwc/tests/task/:metadata:test/details/org.apache.kafka.controller.QuorumControllerTest/testMinIsrUpdateWithElr()/1/output]: {noformat} ... [2025-03-18 07:14:45,121] DEBUG [QuorumController id=0] Processed processBrokerHeartbeat(1474681863) in 140186 us (org.apache.kafka.controller.QuorumController:542) <-- heartbeat for broker 1 ... [2025-03-18 07:14:45,147] DEBUG [QuorumController id=0] Processed processBrokerHeartbeat(399642596) in 10723 us (org.apache.kafka.controller.QuorumController:542) <-- heartbeat for broker 2 ... [2025-03-18 07:14:45,172] DEBUG [QuorumController id=0] Processed processBrokerHeartbeat(168471181) in 13236 us (org.apache.kafka.controller.QuorumController:542) <-- heartbeat for broker 3 ... [2025-03-18 07:14:45,288] INFO [QuorumController id=0] CreateTopics result(s): CreatableTopic(name='foo', ...) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:258) <-- topic creation ... [2025-03-18 07:14:45,455] INFO [QuorumController id=0] Fencing broker 1 at epoch 6 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1693) <-- broker 1 session timeout {noformat} At 07:14:45,121, the broker 1 gets heartbeat and it's active. However, it doesn't get another heartbeat before 07:14:45,421, so it's fenced at 07:14:45,455. We can reproduce this by adding Thread.sleep(300) just after active.creatTopics [0]. IMO, to solve the root cause, we can use another thread to send heartbeat request, so broker 1 doesn't have chance to get fenced. [0] https://github.com/apache/kafka/blob/1ded681684e771b16aa98ae751f39b9816345a83/metadata/src/test/java/org/apache/kafka/controller/QuorumControllerTest.java#L663-L665 > Fix flaky QuorumControllerTest.testMinIsrUpdateWithElr > ------------------------------------------------------ > > Key: KAFKA-18981 > URL: https://issues.apache.org/jira/browse/KAFKA-18981 > Project: Kafka > Issue Type: Improvement > Reporter: Chia-Ping Tsai > Assignee: Chia-Ping Tsai > Priority: Major > > {code:java} > org.opentest4j.AssertionFailedError: PartitionRegistration(replicas=[3, 1, > 2], directories=[vFsJEjZDRlONBTr8543h6A, gQ7fQdQaTFmyno0jsZ2LVA, > 4OWYkhmOTO2eaTFzTyiMEg], isr=[], removingReplicas=[], addingReplicas=[], > elr=[3], lastKnownElr=[3], leader=-1, leaderRecoveryState=RECOVERED, > leaderEpoch=1, partitionEpoch=3) ==> array lengths differ, expected: <1> but > was: <0> > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > app//org.junit.jupiter.api.AssertArrayEquals.assertArraysHaveSameLength(AssertArrayEquals.java:428) > at > app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:237) > at > app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:87) > at > app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1290) > at > app//org.apache.kafka.controller.QuorumControllerTest.testMinIsrUpdateWithElr(QuorumControllerTest.java:699) > at java.base@17.0.14/java.lang.reflect.Method.invoke(Method.java:569) > at java.base@17.0.14/java.util.ArrayList.forEach(ArrayList.java:1511) > at java.base@17.0.14/java.util.ArrayList.forEach(ArrayList.java:1511) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)