[
https://issues.apache.org/jira/browse/IGNITE-26168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Pavlov updated IGNITE-26168:
-----------------------------------
Labels: ise (was: )
> Enhance partition loss detection between cluster restarts
> ----------------------------------------------------------
>
> Key: IGNITE-26168
> URL: https://issues.apache.org/jira/browse/IGNITE-26168
> Project: Ignite
> Issue Type: Task
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: ise
> Fix For: 2.18
>
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> The problem based on real case scenario:
> 1. Cluster with PDS enabled is deactivated and stopped gracefully.
> 2. Some physical servers are replaced with their PDS being cleared during
> maintenance (this may also be done unintentionally or due to some hardware
> issues)
> 3. The replaced servers represent all primary and backups nodes for some
> partitions (cell). As a result the data is lost.
> 4. Cluster is restarted.
> 5. Idle verify procedure completes successfully.
> 6. Cluster is activated successfully.
> As a result, Ignite successfully continues its work after restart. But some
> of the data just disappeared. Ignite users do not see warnings, and data loss
> may be detected accidentally after a while.
> The described situation can be safely resolved by replacing the nodes one by
> one and waiting for the rebalancing to complete.
> But as mentioned in clause 2 PDS data can be lost for different reasons.
> Currently, Ignite supports mechanism for detecting lost partitions, which is
> designed to restrict cache operations in case some cache partitions are lost
> (due to node leaving or failure). But its behaviour is not consistent between
> cluster restarts/activation and deactivation.
> Consider cluster with PDS enabled. The following list shows possible
> scenarious when all partitions owners(parimary and backups) leave the cluster.
> 1. activation -> cell left -> lost parts
> 2. activation -> cell left -> cell joined -> lost parts
> 3. activation -> cell left -> deactivation -> cell joined -> activation ->
> ignored
> 4. activation -> cell left -> cell joined -> deactivation -> activation ->
> lost parts
> 5. activation -> cell left -> deactivation -> activation -> cell joined ->
> lost parts
> 6. deactivation -> cell left -> cell joined -> activation -> ignored
> 7. deactivation -> cell left -> activation -> cell joined -> lost parts
> cell - node group that stores all primary and backup partitions. Can be
> configured via ClusterNodeAttributeColocatedBackupFilter
> lost parts - ignite detected lost partitions. Cache operations are
> restricted according to policy
> ignored - no partition loss is detected. if cell nodes join the cluster
> with PDS data cleared, ignite will not detect partitions loss - it just
> recreates missed partitions
> deactivation - you can also consider a cluster stop after deactivation and
> cluster start before activation
> It is proposed to fix Ignite to detect local partitions for clauses 3 and 6.
> Note that we are considering only case when cluster is stopped gracefully.
> The main idea -
> 1. During PME caused by deactivation, aggregate on coordinator partition info
> and list of lost partitions from all nodes.
> 2. Distribute aggregated information using PME Full Message and store it in
> each node's local metastorage.
> 3. During activation use stored info to detect lost partitions. If some
> partitions has zero update counters in received single messages, but
> according to saved partition info they were updated - mark them as lost.
> Partition Info includes a list of partition IDs that were not
> initialized(update counter == 0, it`s crucial because currently Ignite can't
> distinguish between a partition not being updated at all or being deleted
> between restarts) and list of partition IDs that were marked as lost at the
> time of deactivation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)