sanpwc commented on code in PR #6210:
URL: https://github.com/apache/ignite-3/pull/6210#discussion_r2197330114


##########
modules/distribution-zones/src/main/java/org/apache/ignite/internal/distributionzones/rebalance/ZoneRebalanceRaftGroupEventsListener.java:
##########
@@ -188,11 +193,47 @@ public void onLeaderElected(long term) {
                 try {
                     rebalanceAttempts.set(0);
 
-                    byte[] pendingAssignmentsBytes = 
metaStorageMgr.getLocally(pendingPartAssignmentsQueueKey(zonePartitionId)).value();
+                    // TODO https://issues.apache.org/jira/browse/IGNITE-23633
+                    // First of all, it's required to reread pending 
assignments and recheck whether it's still needed to perform the
+                    // rebalance. Worth mentioning that one of legitimate 
cases of metaStorageMgr.get() timeout is MG unavailability
+                    // in that cases it's required to retry the request. 
However it's important to handle local node stopping intent,
+                    // meaning that busyLock should be handled properly with 
though of a throttle to provide an ability for node to stop.
+
+                    // It's required to read pending assignments from MS 
leader instead of local MS in order not to catch-up stale pending
+                    // ones:
+                    // Let's say that node A has processed configurations ะก1 
and C2 and therefore moved the raft to configuration C2 for
+                    // partition P1.
+                    // Node B was elected as partition P1 leader, however 
locally Node B is a bit outdated within MS timeline, thus it has
+                    // C1 as pending assignments. If it'll propose C1 as "new" 
configuration and then fall, raft will stuck on old
+                    // configuration and won't do further progress.
+                    byte[] pendingAssignmentsBytes = 
metaStorageMgr.get(pendingPartAssignmentsQueueKey(zonePartitionId)).get().value();
 
                     if (pendingAssignmentsBytes != null) {
                         Set<Assignment> pendingAssignments = 
AssignmentsQueue.fromBytes(pendingAssignmentsBytes).poll().nodes();
 
+                        // The race is possible between doStableSwitch 
processing on the node colocated with former leader
+                        // and onLeaderElected of the new leader:
+                        // 1. Node A that was colocated with former leader 
successfully moved raft from C0 to C1, however did not receive
+                        // onNewPeersConfigurationApplied yet and thus didn't 
do doStableSwitch.
+                        // 2. New node B occurred to be collocated with new 
leader and thus received onLeaderElected, checked pending
+                        // assignments in meta storage and retried 
changePeersAndLearnersAsync(C1).
+                        // 3. At the very same moment Node A performs 
doStableSwitch, meaning that it switches assignments pending from
+                        // C1 to C2.
+                        // 4.Node B gets meta storage notification about new 
pending C2 and sends changePeersAndLearnersAsync(C2)
+                        // changePeersAndLearnersAsync(C1) and 
changePeersAndLearnersAsync(C2) from Node B may reorder.
+                        // In order to eliminate this we may check raft 
configuration on leader election and if it matches the one in
+                        // current global pending assignments call 
doStableSwitch instead of changePeersAndLearners since raft already on
+                        // required configuration.
+                        if 
(PeersAndLearners.fromAssignments(pendingAssignments).equals(configuration)) {
+                            doStableKeySwitch(

Review Comment:
   > why do you use this method but not doStableKeySwitchWithExceptionHandling?
   
   There's no reason. Switched to `doStableKeySwitchWithExceptionHandling`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@ignite.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to