[ https://issues.apache.org/jira/browse/IGNITE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Chudov updated IGNITE-24069: ---------------------------------- Description: *Motivation* In Raft, the configuration switch requires joint consensus, where the nodes from old and new configurations are included with corresponding roles. So, we cannot just include any node as a follower into the new configuration having it as a learner in the previous one. The rule of joint consensus requires that this node should be removed as a learner and after that included into the next configuration as a peer, so there will be two configuration switches. The downgrading should look the same. The handlers of the pending and stable assignments’ switch should be aware of the changes when some node (let’s say, node A) is turned from a learner into the peer or otherwise, from peer to learner. There should be two consequent configuration switches for either upgrade or downgrade, where in the first one, node A will be removed as the learner, in the second one, it will be added as peer. The values for meta storage pending assignments prefix "assignments.pending." should be turned into a queue of pending assignments. It is created for a replication group by the rebalance trigger or during the switch of planned assignments to pending, when it is detected that the direct transition from stable assignments to pending is not possible. It will store the queue of assignments, where each of them will contain some intermediate state of Raft configuration, and only the last assignments in the queue will be the target assignments. It is important that the whole queue is logically the one rebalance, scheduled by a single trigger. It can be modified only in the process of rebalancing. The meaning of stable and planned assignments is not changed, and the stable assignments’ switch happens only after the whole pending assignments queue has been processed. So, no replicas should be stopped until that moment (only Raft configurations may be changed), because replicas are stopped and storages are deleted only by the stable assignments’ change listener. *Definition of done* Pending assignments are turned into a queue without the change in the logic. This is the pre-requisite for further changes. Pending assignments’ change handler should process the first element of PAQ, performing changePeersAndLearnersAsync() using assignments from it. Listeners of leader reeclection and primary replica change should also be adjusted. *Implementation notes* There are 2 different pending assignments: for tables and for zones (until data colocation is implemented and the responsibility for partitions is fully transferred to zones): RebalanceUtil#PENDING_ASSIGNMENTS_PREFIX and ZoneRebalanceUtil#PENDING_ASSIGNMENTS_PREFIX. This ticket is about them both. was: *Motivation* In Raft, the configuration switch requires joint consensus, where the nodes from old and new configurations are included with corresponding roles. So, we cannot just include any node as a follower into the new configuration having it as a learner in the previous one. The rule of joint consensus requires that this node should be removed as a learner and after that included into the next configuration as a peer, so there will be two configuration switches. The downgrading should look the same. The handlers of the pending and stable assignments’ switch should be aware of the changes when some node (let’s say, node A) is turned from a learner into the peer or otherwise, from peer to learner. There should be two consequent configuration switches for either upgrade or downgrade, where in the first one, node A will be removed as the learner, in the second one, it will be added as peer. The values for meta storage pending assignments prefix "assignments.pending." should be turned into a queue of pending assignments. It is created for a replication group by the rebalance trigger or during the switch of planned assignments to pending, when it is detected that the direct transition from stable assignments to pending is not possible. It will store the queue of assignments, where each of them will contain some intermediate state of Raft configuration, and only the last assignments in the queue will be the target assignments. It is important that the whole queue is logically the one rebalance, scheduled by a single trigger. It can be modified only in the process of rebalancing. The meaning of stable and planned assignments is not changed, and the stable assignments’ switch happens only after the whole pending assignments queue has been processed. So, no replicas should be stopped until that moment (only Raft configurations may be changed), because replicas are stopped and storages are deleted only by the stable assignments’ change listener. *Definition of done* Pending assignments are turned into a queue without the change in the logic. This is the pre-requisite for further changes. Pending assignments’ change handler should process the first element of PAQ, performing changePeersAndLearnersAsync() using assignments from it. Listeners of leader reeclection and primary replica change should also be adjusted. > Turn the pending assignments into a queue > ----------------------------------------- > > Key: IGNITE-24069 > URL: https://issues.apache.org/jira/browse/IGNITE-24069 > Project: Ignite > Issue Type: Improvement > Reporter: Denis Chudov > Priority: Major > Labels: ignite-3 > > *Motivation* > In Raft, the configuration switch requires joint consensus, where the nodes > from old and new configurations are included with corresponding roles. So, we > cannot just include any node as a follower into the new configuration having > it as a learner in the previous one. The rule of joint consensus requires > that this node should be removed as a learner and after that included into > the next configuration as a peer, so there will be two configuration > switches. The downgrading should look the same. > The handlers of the pending and stable assignments’ switch should be aware of > the changes when some node (let’s say, node A) is turned from a learner into > the peer or otherwise, from peer to learner. There should be two consequent > configuration switches for either upgrade or downgrade, where in the first > one, node A will be removed as the learner, in the second one, it will be > added as peer. > The values for meta storage pending assignments prefix "assignments.pending." > should be turned into a queue of pending assignments. It is created for a > replication group by the rebalance trigger or during the switch of planned > assignments to pending, when it is detected that the direct transition from > stable assignments to pending is not possible. It will store the queue of > assignments, where each of them will contain some intermediate state of Raft > configuration, and only the last assignments in the queue will be the target > assignments. > It is important that the whole queue is logically the one rebalance, > scheduled by a single trigger. It can be modified only in the process of > rebalancing. The meaning of stable and planned assignments is not changed, > and the stable assignments’ switch happens only after the whole pending > assignments queue has been processed. So, no replicas should be stopped until > that moment (only Raft configurations may be changed), because replicas are > stopped and storages are deleted only by the stable assignments’ change > listener. > *Definition of done* > Pending assignments are turned into a queue without the change in the logic. > This is the pre-requisite for further changes. > Pending assignments’ change handler should process the first element of PAQ, > performing changePeersAndLearnersAsync() using assignments from it. > Listeners of leader reeclection and primary replica change should also be > adjusted. > *Implementation notes* > There are 2 different pending assignments: for tables and for zones (until > data colocation is implemented and the responsibility for partitions is fully > transferred to zones): RebalanceUtil#PENDING_ASSIGNMENTS_PREFIX and > ZoneRebalanceUtil#PENDING_ASSIGNMENTS_PREFIX. This ticket is about them both. -- This message was sent by Atlassian Jira (v8.20.10#820010)