[DISCUSS] Data loss handling improvements

Alexei Scherbakov Wed, 06 May 2020 01:45:42 -0700

Folks,

I've almost finished a patch bringing some improvements to the data loss
handling code, and I wish to discuss proposed changes with the community
before submitting.

*The issue*

During the grid's lifetime, it's possible to get into a situation when some
data nodes have failed or mistakenly stopped. If a number of stopped nodes
exceeds a certain threshold depending on configured backups, count a data
loss will occur. For example, a grid having one backup (meaning at least
two copies of each data partition exist at the same time) can tolerate only
one node loss at the time. Generally, data loss is guaranteed to occur if
backups + 1 or more nodes have failed simultaneously using default affinity
function.

For in-memory caches, data is lost forever. For persistent caches, data is
not physically lost and accessible again after failed nodes are returned to
the topology.

Possible data loss should be taken into consideration while designing an
application.

*Consider an example: money is transferred from one deposit to another, and
all nodes holding data for one of the deposits are gone.In such a case, a
transaction temporary cannot be completed until a cluster is recovered from
the data loss state. Ignoring this can cause data inconsistency.*
It is necessary to have an API telling us if an operation is safe to
complete from the perspective of data loss.

Such an API exists for some time [1] [2] [3]. In short, a grid can be
configured to switch caches to the partial availability mode if data loss
is detected.

Let's give two definitions according to the Javadoc for
*PartitionLossPolicy*:

· *Safe* (data loss handling) *policy* - cache operations are only
available for non-lost partitions (PartitionLossPolicy != IGNORE).

· *Unsafe policy* - cache operations are always possible
(PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
partitions automatically re-created on the remaining nodes if needed or
immediately owned if a last supplier has left during rebalancing.

*That needs to be fixed*

1. The default loss policy is unsafe, even for persistent caches in the
current implementation. It can result in unintentional data loss and
business invariants' failure.

2. Node restarts in the persistent grid with detected data loss will cause
automatic resetting of LOST state after the restart, even if the safe
policy is configured. It can result in data loss or partition desync if not
all nodes are returned to the topology or returned in the wrong order.

*An example: a grid has three nodes, one backup. The grid is under load.
First, a node2 has left, soon a node3 has left. If the node2 is returned to
the topology first, it would have stale data for some keys. Most recent
data are on node3, which is not in the topology yet. Because a lost state
was reset, all caches are fully available, and most probably will become
inconsistent even in safe mode.*
3. Configured loss policy doesn't provide guarantees described in the
Javadoc depending on the cluster configuration[4]. In particular, unsafe
policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
automatically readjusted on node left), because partitions are not
automatically get reassigned on topology change, and no nodes are existing
to fulfill a read/write request. Same for READ_ONLY_ALL and READ_WRITE_ALL.

4. Calling resetLostPartitions doesn't provide a guarantee for full cache
operations availability if a topology doesn't have at least one owner for
each lost partition.

The ultimate goal of the patch is to fix API inconsistencies and fix the
most crucial bugs related to data loss handling.

*The planned changes are:*

1. The safe policy is used by default, except for in-memory grids with
enabled baseline auto-adjust [5] with zero timeout [6]. In the latter case,
the unsafe policy is used by default. It protects from unintentional data
loss.

2. Lost state is never reset in the case of grid nodes restart (despite
full restart). It makes real data loss impossible in persistent grids if
following the recovery instruction.

3. Lost state is impossible to reset if a topology doesn't have at least
one owner for each lost partition. If nodes are physically dead, they
should be removed from a baseline first before calling resetLostPartitions.

4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because their
guarantees are impossible to fulfill, not on the full baseline.

5. Any operation failed due to data loss contains
CacheInvalidStateException as a root cause.

In addition to code fixes, I plan to write a tutorial for safe data loss
recovery in the persistent mode in the Ignite wiki.

Any comments for the proposed changes are welcome.

[1]
org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
partLossPlc)
[2] org.apache.ignite.Ignite#resetLostPartitions(caches)
[3] org.apache.ignite.IgniteCache#lostPartitions
[4] https://issues.apache.org/jira/browse/IGNITE-10041
[5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
[6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)

[DISCUSS] Data loss handling improvements

Reply via email to