[ 
https://issues.apache.org/jira/browse/IGNITE-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-24069:
----------------------------------
    Description: 
*Motivation*

In Raft, the configuration switch requires joint consensus, where the nodes 
from old and new configurations are included with corresponding roles. So, we 
cannot just include any node as a follower into the new configuration having it 
as a learner in the previous one. The rule of joint consensus requires that 
this node should be removed as a learner and after that included into the next 
configuration as a peer, so there will be two configuration switches. The 
downgrading should look the same.

The handlers of the pending and stable assignments’ switch should be aware of 
the changes when some node (let’s say, node A) is turned from a learner into 
the peer or otherwise, from peer to learner. There should be two consequent 
configuration switches for either upgrade or downgrade, where in the first one, 
node A will be removed as the learner, in the second one, it will be added as 
peer. 

The values for meta storage pending assignments prefix "assignments.pending." 
should be turned into a queue of pending assignments. It is created for a 
replication group by the rebalance trigger or during the switch of planned 
assignments to pending, when it is detected that the direct transition from 
stable assignments to pending is not possible. It will store the queue of 
assignments, where each of them will contain some intermediate state of Raft 
configuration, and only the last assignments in the queue will be the target 
assignments. 

It is important that the whole queue is logically the one rebalance, scheduled 
by a single trigger. It can be modified only in the process of rebalancing. The 
meaning of stable and planned assignments is not changed, and the stable 
assignments’ switch happens only after the whole pending assignments queue has 
been processed. So, no replicas should be stopped until that moment (only Raft 
configurations may be changed), because replicas are stopped and storages are 
deleted only by the stable assignments’ change listener.

*Definition of done*

Pending assignments are turned into a queue without the change in the logic. 
This is the pre-requisite for further changes.

Pending assignments’ change handler should process the first element of PAQ, 
performing changePeersAndLearnersAsync() using assignments from it.

Listeners of leader reeclection and primary replica change should also be 
adjusted.

*Implementation notes*

There are 2 different pending assignments: for tables and for zones (until data 
colocation is implemented and the responsibility for partitions is fully 
transferred to zones): RebalanceUtil#PENDING_ASSIGNMENTS_PREFIX and 
ZoneRebalanceUtil#PENDING_ASSIGNMENTS_PREFIX. This ticket is about them both.

  was:
*Motivation*

In Raft, the configuration switch requires joint consensus, where the nodes 
from old and new configurations are included with corresponding roles. So, we 
cannot just include any node as a follower into the new configuration having it 
as a learner in the previous one. The rule of joint consensus requires that 
this node should be removed as a learner and after that included into the next 
configuration as a peer, so there will be two configuration switches. The 
downgrading should look the same.

The handlers of the pending and stable assignments’ switch should be aware of 
the changes when some node (let’s say, node A) is turned from a learner into 
the peer or otherwise, from peer to learner. There should be two consequent 
configuration switches for either upgrade or downgrade, where in the first one, 
node A will be removed as the learner, in the second one, it will be added as 
peer. 

The values for meta storage pending assignments prefix "assignments.pending." 
should be turned into a queue of pending assignments. It is created for a 
replication group by the rebalance trigger or during the switch of planned 
assignments to pending, when it is detected that the direct transition from 
stable assignments to pending is not possible. It will store the queue of 
assignments, where each of them will contain some intermediate state of Raft 
configuration, and only the last assignments in the queue will be the target 
assignments. 

It is important that the whole queue is logically the one rebalance, scheduled 
by a single trigger. It can be modified only in the process of rebalancing. The 
meaning of stable and planned assignments is not changed, and the stable 
assignments’ switch happens only after the whole pending assignments queue has 
been processed. So, no replicas should be stopped until that moment (only Raft 
configurations may be changed), because replicas are stopped and storages are 
deleted only by the stable assignments’ change listener.

*Definition of done*

Pending assignments are turned into a queue without the change in the logic. 
This is the pre-requisite for further changes.

Pending assignments’ change handler should process the first element of PAQ, 
performing changePeersAndLearnersAsync() using assignments from it.

Listeners of leader reeclection and primary replica change should also be 
adjusted.


> Turn the pending assignments into a queue
> -----------------------------------------
>
>                 Key: IGNITE-24069
>                 URL: https://issues.apache.org/jira/browse/IGNITE-24069
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> *Motivation*
> In Raft, the configuration switch requires joint consensus, where the nodes 
> from old and new configurations are included with corresponding roles. So, we 
> cannot just include any node as a follower into the new configuration having 
> it as a learner in the previous one. The rule of joint consensus requires 
> that this node should be removed as a learner and after that included into 
> the next configuration as a peer, so there will be two configuration 
> switches. The downgrading should look the same.
> The handlers of the pending and stable assignments’ switch should be aware of 
> the changes when some node (let’s say, node A) is turned from a learner into 
> the peer or otherwise, from peer to learner. There should be two consequent 
> configuration switches for either upgrade or downgrade, where in the first 
> one, node A will be removed as the learner, in the second one, it will be 
> added as peer. 
> The values for meta storage pending assignments prefix "assignments.pending." 
> should be turned into a queue of pending assignments. It is created for a 
> replication group by the rebalance trigger or during the switch of planned 
> assignments to pending, when it is detected that the direct transition from 
> stable assignments to pending is not possible. It will store the queue of 
> assignments, where each of them will contain some intermediate state of Raft 
> configuration, and only the last assignments in the queue will be the target 
> assignments. 
> It is important that the whole queue is logically the one rebalance, 
> scheduled by a single trigger. It can be modified only in the process of 
> rebalancing. The meaning of stable and planned assignments is not changed, 
> and the stable assignments’ switch happens only after the whole pending 
> assignments queue has been processed. So, no replicas should be stopped until 
> that moment (only Raft configurations may be changed), because replicas are 
> stopped and storages are deleted only by the stable assignments’ change 
> listener.
> *Definition of done*
> Pending assignments are turned into a queue without the change in the logic. 
> This is the pre-requisite for further changes.
> Pending assignments’ change handler should process the first element of PAQ, 
> performing changePeersAndLearnersAsync() using assignments from it.
> Listeners of leader reeclection and primary replica change should also be 
> adjusted.
> *Implementation notes*
> There are 2 different pending assignments: for tables and for zones (until 
> data colocation is implemented and the responsibility for partitions is fully 
> transferred to zones): RebalanceUtil#PENDING_ASSIGNMENTS_PREFIX and 
> ZoneRebalanceUtil#PENDING_ASSIGNMENTS_PREFIX. This ticket is about them both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to