mumrah opened a new pull request, #12269:
URL: https://github.com/apache/kafka/pull/12269

   This patch does two things to tighten up races during bootstrap. One is to 
simply move the volatile write until after the "bootstrapMetadata" event has 
been enqueued. The other thing is to prepend "bootstrapMetadata" to the queue.
   
   From the JIRA, this test was flaky when we see the "registerBroker" event 
precede the "boostrapMetadata" event
   
   ```
   handleLeaderChange() start
   appendWriteEvent(registerBroker)
   appendWriteEvent(bootstrapMetadata)
   handleLeaderChange() finish
   registerBroker() -> writes broker registration to log
   bootstrapMetadata() -> writes bootstrap metadata to log
   ```
   
   The test harness in QuorumControllerTest is polling the controllers to see 
when their current leader epoch matches the expected leader epoch
   
   ```java
           QuorumController activeController() throws InterruptedException {
           AtomicReference<QuorumController> value = new 
AtomicReference<>(null);
           TestUtils.retryOnExceptionWithTimeout(20000, 3, () -> {
               LeaderAndEpoch leader = logEnv.leaderAndEpoch();
               for (QuorumController controller : controllers) {
                   if 
(OptionalInt.of(controller.nodeId()).equals(leader.leaderId()) &&
                 --->  controller.curClaimEpoch() == leader.epoch()) { <---
                       value.set(controller);
                       break;
                   }
               }
   
               if (value.get() == null) {
                   throw new RuntimeException(String.format("Expected to see %s 
as leader", leader));
               }
           });
   
           return value.get();
       }
   ```
   
   Prior to this patch, we were setting the `curClaimEpoch` volatile before 
scheduling the "bootstrapMetadata" event. This creates a race where the test 
code can see the controller as active before the bootstrap metadata has been 
enqueued. By moving the volatile write after the "bootstrapMetadata" event 
enqueueing, we can eliminate this race.
   
   In production, there is nothing stopping clients from making requests to a 
controller that is in the process of starting up. Once the socket server begins 
processing requests, we will begin seeing events on the QuorumController queue. 
The check for active controller is done when events are processed, so we could 
have events enqueued at any time. By prepending the "bootstrapMetadata" event, 
we can ensure any events that see the controller as active will come after the 
bootstrap metadata has been processed.
   
   If we handle a "registerBroker" event before "bootstrapMetadata" is 
enqueued, then we are sure to see `curClaimEpoch = -1`. If a "registerBroker" 
is waiting in the queue while we finish bootstrapping, then we are sure to 
enqueue "bootstrapMetadata" before it. This means "registerBroker" (and any 
other event) is guaranteed to see the controller as inactive _or_ that it is 
active and the bootstrap metadata has been written.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to