Re: [DISCUSS] Data loss handling improvements

Anton Vinogradov Wed, 06 May 2020 02:54:42 -0700

Alexei,

1,2,4,5 - looks good to me, no objections here.


>> 3. Lost state is impossible to reset if a topology doesn't have at least
>> one owner for each lost partition.

Do you mean that, according to your example, where
>> a node2 has left, soon a node3 has left. If the node2 is returned to
>> the topology first, it would have stale data for some keys.
we have to have node2 at cluster to be able to reset "lost" to node2's data?

>> at least one owner for each lost partition.
What the reason to have owners for all lost partitions when we want to
reset only some (available)?
Will it be possible to perform operations on non-lost partitions when the
cluster has at least one lost partition?

On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
alexey.scherbak...@gmail.com> wrote:

> Folks,
>
> I've almost finished a patch bringing some improvements to the data loss
> handling code, and I wish to discuss proposed changes with the community
> before submitting.
>
> *The issue*
>
> During the grid's lifetime, it's possible to get into a situation when some
> data nodes have failed or mistakenly stopped. If a number of stopped nodes
> exceeds a certain threshold depending on configured backups, count a data
> loss will occur. For example, a grid having one backup (meaning at least
> two copies of each data partition exist at the same time) can tolerate only
> one node loss at the time. Generally, data loss is guaranteed to occur if
> backups + 1 or more nodes have failed simultaneously using default affinity
> function.
>
> For in-memory caches, data is lost forever. For persistent caches, data is
> not physically lost and accessible again after failed nodes are returned to
> the topology.
>
> Possible data loss should be taken into consideration while designing an
> application.
>
>
>
> *Consider an example: money is transferred from one deposit to another, and
> all nodes holding data for one of the deposits are gone.In such a case, a
> transaction temporary cannot be completed until a cluster is recovered from
> the data loss state. Ignoring this can cause data inconsistency.*
> It is necessary to have an API telling us if an operation is safe to
> complete from the perspective of data loss.
>
> Such an API exists for some time [1] [2] [3]. In short, a grid can be
> configured to switch caches to the partial availability mode if data loss
> is detected.
>
> Let's give two definitions according to the Javadoc for
> *PartitionLossPolicy*:
>
> ·   *Safe* (data loss handling) *policy* - cache operations are only
> available for non-lost partitions (PartitionLossPolicy != IGNORE).
>
> ·   *Unsafe policy* - cache operations are always possible
> (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
> partitions automatically re-created on the remaining nodes if needed or
> immediately owned if a last supplier has left during rebalancing.
>
> *That needs to be fixed*
>
> 1. The default loss policy is unsafe, even for persistent caches in the
> current implementation. It can result in unintentional data loss and
> business invariants' failure.
>
> 2. Node restarts in the persistent grid with detected data loss will cause
> automatic resetting of LOST state after the restart, even if the safe
> policy is configured. It can result in data loss or partition desync if not
> all nodes are returned to the topology or returned in the wrong order.
>
>
> *An example: a grid has three nodes, one backup. The grid is under load.
> First, a node2 has left, soon a node3 has left. If the node2 is returned to
> the topology first, it would have stale data for some keys. Most recent
> data are on node3, which is not in the topology yet. Because a lost state
> was reset, all caches are fully available, and most probably will become
> inconsistent even in safe mode.*
> 3. Configured loss policy doesn't provide guarantees described in the
> Javadoc depending on the cluster configuration[4]. In particular, unsafe
> policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> automatically readjusted on node left), because partitions are not
> automatically get reassigned on topology change, and no nodes are existing
> to fulfill a read/write request. Same for READ_ONLY_ALL and READ_WRITE_ALL.
>
> 4. Calling resetLostPartitions doesn't provide a guarantee for full cache
> operations availability if a topology doesn't have at least one owner for
> each lost partition.
>
> The ultimate goal of the patch is to fix API inconsistencies and fix the
> most crucial bugs related to data loss handling.
>
> *The planned changes are:*
>
> 1. The safe policy is used by default, except for in-memory grids with
> enabled baseline auto-adjust [5] with zero timeout [6]. In the latter case,
> the unsafe policy is used by default. It protects from unintentional data
> loss.
>
> 2. Lost state is never reset in the case of grid nodes restart (despite
> full restart). It makes real data loss impossible in persistent grids if
> following the recovery instruction.
>
> 3. Lost state is impossible to reset if a topology doesn't have at least
> one owner for each lost partition. If nodes are physically dead, they
> should be removed from a baseline first before calling resetLostPartitions.
>
> 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because their
> guarantees are impossible to fulfill, not on the full baseline.
>
> 5. Any operation failed due to data loss contains
> CacheInvalidStateException as a root cause.
>
> In addition to code fixes, I plan to write a tutorial for safe data loss
> recovery in the persistent mode in the Ignite wiki.
>
> Any comments for the proposed changes are welcome.
>
> [1]
>
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> partLossPlc)
> [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> [3] org.apache.ignite.IgniteCache#lostPartitions
> [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
>

Re: [DISCUSS] Data loss handling improvements

Reply via email to