Dimitar Dimitrov created KAFKA-15052:
----------------------------------------

             Summary: Fix flaky test 
QuorumControllerTest.testBalancePartitionLeaders()
                 Key: KAFKA-15052
                 URL: https://issues.apache.org/jira/browse/KAFKA-15052
             Project: Kafka
          Issue Type: Test
            Reporter: Dimitar Dimitrov
            Assignee: Dimitar Dimitrov


Test failed at 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/]
 as well as in various local runs.

The test creates a topic, fences a broker, notes partition imbalance due to 
another broker taking over the partition the fenced broker lost, re-registers 
and unfences the fenced broker, sends `AlterPartition` for the lost partition 
adding the now unfenced broker back to its ISR, then waits for the partition 
imbalance to disappear.

The local failures seem to happen when the brokers (including the ones that 
never get fenced by the test) accidentally get fenced by losing their session 
due to reaching the (aggressively low for test purposes) session timeout.
The Cloudbees failure quoted above also seems to indicate that this happened:
{code:java}
...[truncated 738209 chars]...
23. (org.apache.kafka.controller.QuorumController:768)
[2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write event 
for maybeBalancePartitionLeaders because scheduled (DEFERRED), checkIntervalNs 
(OptionalLong[1000000000]) and isImbalanced (true) 
(org.apache.kafka.controller.QuorumController:1401)
[2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 because 
its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager:1459)
[2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: 
changing partition(s): foo-0, foo-1, foo-2 
(org.apache.kafka.controller.ReplicationControlManager:1750)
[2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for 
foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 
3, partitionEpoch: 2 -> 3 
(org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for 
foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: 3 
-> 4, partitionEpoch: 4 -> 5 
(org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for 
foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 
3, partitionEpoch: 2 -> 3 
(org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, 
appendTimestamp=240, 
records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, 
topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, 
removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 
0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, 
topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, 
removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 
0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, 
topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, 
removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 
0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, 
brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), 
prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253)
[2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory 
snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197)
[2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log check. 
(org.apache.kafka.metalog.LocalLogManager:512)
[2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation 
maybeFenceReplicas(451616131) will be completed when the log reaches offset 27. 
(org.apache.kafka.controller.QuorumController:768)
[2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 because 
its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager:1459)
[2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: 
changing partition(s): foo-1 
(org.apache.kafka.controller.ReplicationControlManager:1750)
[2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for 
foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 -> 
5, partitionEpoch: 5 -> 6 
(org.apache.kafka.controller.ReplicationControlManager:157){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to