sanpwc commented on code in PR #6210: URL: https://github.com/apache/ignite-3/pull/6210#discussion_r2197330114
########## modules/distribution-zones/src/main/java/org/apache/ignite/internal/distributionzones/rebalance/ZoneRebalanceRaftGroupEventsListener.java: ########## @@ -188,11 +193,47 @@ public void onLeaderElected(long term) { try { rebalanceAttempts.set(0); - byte[] pendingAssignmentsBytes = metaStorageMgr.getLocally(pendingPartAssignmentsQueueKey(zonePartitionId)).value(); + // TODO https://issues.apache.org/jira/browse/IGNITE-23633 + // First of all, it's required to reread pending assignments and recheck whether it's still needed to perform the + // rebalance. Worth mentioning that one of legitimate cases of metaStorageMgr.get() timeout is MG unavailability + // in that cases it's required to retry the request. However it's important to handle local node stopping intent, + // meaning that busyLock should be handled properly with though of a throttle to provide an ability for node to stop. + + // It's required to read pending assignments from MS leader instead of local MS in order not to catch-up stale pending + // ones: + // Let's say that node A has processed configurations ะก1 and C2 and therefore moved the raft to configuration C2 for + // partition P1. + // Node B was elected as partition P1 leader, however locally Node B is a bit outdated within MS timeline, thus it has + // C1 as pending assignments. If it'll propose C1 as "new" configuration and then fall, raft will stuck on old + // configuration and won't do further progress. + byte[] pendingAssignmentsBytes = metaStorageMgr.get(pendingPartAssignmentsQueueKey(zonePartitionId)).get().value(); if (pendingAssignmentsBytes != null) { Set<Assignment> pendingAssignments = AssignmentsQueue.fromBytes(pendingAssignmentsBytes).poll().nodes(); + // The race is possible between doStableSwitch processing on the node colocated with former leader + // and onLeaderElected of the new leader: + // 1. Node A that was colocated with former leader successfully moved raft from C0 to C1, however did not receive + // onNewPeersConfigurationApplied yet and thus didn't do doStableSwitch. + // 2. New node B occurred to be collocated with new leader and thus received onLeaderElected, checked pending + // assignments in meta storage and retried changePeersAndLearnersAsync(C1). + // 3. At the very same moment Node A performs doStableSwitch, meaning that it switches assignments pending from + // C1 to C2. + // 4.Node B gets meta storage notification about new pending C2 and sends changePeersAndLearnersAsync(C2) + // changePeersAndLearnersAsync(C1) and changePeersAndLearnersAsync(C2) from Node B may reorder. + // In order to eliminate this we may check raft configuration on leader election and if it matches the one in + // current global pending assignments call doStableSwitch instead of changePeersAndLearners since raft already on + // required configuration. + if (PeersAndLearners.fromAssignments(pendingAssignments).equals(configuration)) { + doStableKeySwitch( Review Comment: > why do you use this method but not doStableKeySwitchWithExceptionHandling? There's no reason. Switched to `doStableKeySwitchWithExceptionHandling`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@ignite.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org