[jira] [Updated] (IGNITE-19087) Cancel rebalance mechanism

Kirill Gusakov (Jira) Fri, 05 May 2023 06:53:45 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-19087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kirill Gusakov updated IGNITE-19087:
------------------------------------
    Description: 
Sometimes we must cancel the ongoing rebalance:
 * We can receive an unrecoverable error from replication group during the 
current rebalance
 * We can decide to cancel it manually

[!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]

 

 

 
h3. 1. Put rebalance intent to *.cancel key

For the purpose of persisting for cancel intent, we must save the (oldTopology, 
newTopology) pair of peers lists to {{zoneId.assignment.cancel}} key. Also, 
every invoke with update of {{*.cancel}} key must be enriched by revision of 
the pending key, which must be cancelled:
 {{    if(zoneId.assignment.pending.revision == inputRevision):
        zoneId.assignment.cancel = cancelValue
        return true
    else:
        return false}}
It's needed to prevent the race, between the rebalance done and cancel 
persisting, otherwise we can try to cancel the wrong rebalance process.
h3. 
[|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/rebalance.md#2-primaryreplica-replicationgroup-cancel-protocol]
h3. 2. PrimaryReplica->ReplicationGroup cancel protocol

When PrimaryReplica send {{CancelRebalanceRequest(oldTopology, newTopology)}} 
to the ReplicationGroup following cases are possible:
 * Replication group has ongoing rebalance oldTopology->newTopology. So, it 
must be cancelled and cleanup for the configuration state of replication group 
to oldTopology must be executed.
 * Replication group has no ongoing rebalance and currentTopology==oldTopology. 
So, nothing to cancel, return success response.
 * Replication group has no ongoing rebalance and currentTopology==newTopology. 
So, cancel request can't be executed, return the response about it. Result 
recipient of this response (placement driver) must log this fact and do the 
same routine for usual rebalanceDone.

  was:
Sometimes we must cancel the ongoing rebalance:
 * We can receive an unrecoverable error from replication group during the 
current rebalance
 * We can decide to cancel it manually

[!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]

 

 

 

 


> Cancel rebalance mechanism
> --------------------------
>
>                 Key: IGNITE-19087
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19087
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Kirill Gusakov
>            Priority: Major
>              Labels: ignite-3
>
> Sometimes we must cancel the ongoing rebalance:
>  * We can receive an unrecoverable error from replication group during the 
> current rebalance
>  * We can decide to cancel it manually
> [!https://github.com/apache/ignite-3/raw/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg!|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg]
>  
>  
>  
> h3. 1. Put rebalance intent to *.cancel key
> For the purpose of persisting for cancel intent, we must save the 
> (oldTopology, newTopology) pair of peers lists to 
> {{zoneId.assignment.cancel}} key. Also, every invoke with update of 
> {{*.cancel}} key must be enriched by revision of the pending key, which must 
> be cancelled:
>  {{    if(zoneId.assignment.pending.revision == inputRevision):
>         zoneId.assignment.cancel = cancelValue
>         return true
>     else:
>         return false}}
> It's needed to prevent the race, between the rebalance done and cancel 
> persisting, otherwise we can try to cancel the wrong rebalance process.
> h3. 
> [|https://github.com/apache/ignite-3/blob/c276a334c4c3742520494bccdc957f4530c8ed7a/modules/distribution-zones/tech-notes/rebalance.md#2-primaryreplica-replicationgroup-cancel-protocol]
> h3. 2. PrimaryReplica->ReplicationGroup cancel protocol
> When PrimaryReplica send {{CancelRebalanceRequest(oldTopology, newTopology)}} 
> to the ReplicationGroup following cases are possible:
>  * Replication group has ongoing rebalance oldTopology->newTopology. So, it 
> must be cancelled and cleanup for the configuration state of replication 
> group to oldTopology must be executed.
>  * Replication group has no ongoing rebalance and 
> currentTopology==oldTopology. So, nothing to cancel, return success response.
>  * Replication group has no ongoing rebalance and 
> currentTopology==newTopology. So, cancel request can't be executed, return 
> the response about it. Result recipient of this response (placement driver) 
> must log this fact and do the same routine for usual rebalanceDone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-19087) Cancel rebalance mechanism

Reply via email to