Re: [DISCUSS] Data loss handling improvements

Alexei Scherbakov Wed, 06 May 2020 10:02:28 -0700

ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <a...@apache.org>:


> Alexei,
>
> 1,2,4,5 - looks good to me, no objections here.
>
> >> 3. Lost state is impossible to reset if a topology doesn't have at least
> >> one owner for each lost partition.
>
> Do you mean that, according to your example, where
> >> a node2 has left, soon a node3 has left. If the node2 is returned to
> >> the topology first, it would have stale data for some keys.
> we have to have node2 at cluster to be able to reset "lost" to node2's
> data?
>

Not sure if I understand a question, but try to answer using an example:
Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
owned by n2 and n3.
1. Topology is activated.
2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
3. n2 has failed.
4. cache.put(p, 1) // n3 has p->1, updateCounter=2
5. n3 has failed, partition loss is happened.
6. n2 joins a topology, it has stale data (p->0)

We actually have 2 issues:
7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is diverged
and will not be adjusted by counters rebalancing if n3 is later joins a
topology.
or
8. n3 joins a topology, it has actual data (p->1) but rebalancing will not
work because joining node has highest counter (it can only be a demander in
this scenario).

In both cases rebalancing by counters will not work causing data divergence
in copies.


>
> >> at least one owner for each lost partition.
> What the reason to have owners for all lost partitions when we want to
> reset only some (available)?
>

It's never were possible to reset only subset of lost partitions. The
reason is to make guarantee of resetLostPartitions method - all cache
operations are resumed, data is correct.


> Will it be possible to perform operations on non-lost partitions when the
> cluster has at least one lost partition?
>

Yes it will be.


>
> On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> alexey.scherbak...@gmail.com> wrote:
>
> > Folks,
> >
> > I've almost finished a patch bringing some improvements to the data loss
> > handling code, and I wish to discuss proposed changes with the community
> > before submitting.
> >
> > *The issue*
> >
> > During the grid's lifetime, it's possible to get into a situation when
> some
> > data nodes have failed or mistakenly stopped. If a number of stopped
> nodes
> > exceeds a certain threshold depending on configured backups, count a data
> > loss will occur. For example, a grid having one backup (meaning at least
> > two copies of each data partition exist at the same time) can tolerate
> only
> > one node loss at the time. Generally, data loss is guaranteed to occur if
> > backups + 1 or more nodes have failed simultaneously using default
> affinity
> > function.
> >
> > For in-memory caches, data is lost forever. For persistent caches, data
> is
> > not physically lost and accessible again after failed nodes are returned
> to
> > the topology.
> >
> > Possible data loss should be taken into consideration while designing an
> > application.
> >
> >
> >
> > *Consider an example: money is transferred from one deposit to another,
> and
> > all nodes holding data for one of the deposits are gone.In such a case, a
> > transaction temporary cannot be completed until a cluster is recovered
> from
> > the data loss state. Ignoring this can cause data inconsistency.*
> > It is necessary to have an API telling us if an operation is safe to
> > complete from the perspective of data loss.
> >
> > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > configured to switch caches to the partial availability mode if data loss
> > is detected.
> >
> > Let's give two definitions according to the Javadoc for
> > *PartitionLossPolicy*:
> >
> > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> >
> > ·   *Unsafe policy* - cache operations are always possible
> > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
> > partitions automatically re-created on the remaining nodes if needed or
> > immediately owned if a last supplier has left during rebalancing.
> >
> > *That needs to be fixed*
> >
> > 1. The default loss policy is unsafe, even for persistent caches in the
> > current implementation. It can result in unintentional data loss and
> > business invariants' failure.
> >
> > 2. Node restarts in the persistent grid with detected data loss will
> cause
> > automatic resetting of LOST state after the restart, even if the safe
> > policy is configured. It can result in data loss or partition desync if
> not
> > all nodes are returned to the topology or returned in the wrong order.
> >
> >
> > *An example: a grid has three nodes, one backup. The grid is under load.
> > First, a node2 has left, soon a node3 has left. If the node2 is returned
> to
> > the topology first, it would have stale data for some keys. Most recent
> > data are on node3, which is not in the topology yet. Because a lost state
> > was reset, all caches are fully available, and most probably will become
> > inconsistent even in safe mode.*
> > 3. Configured loss policy doesn't provide guarantees described in the
> > Javadoc depending on the cluster configuration[4]. In particular, unsafe
> > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > automatically readjusted on node left), because partitions are not
> > automatically get reassigned on topology change, and no nodes are
> existing
> > to fulfill a read/write request. Same for READ_ONLY_ALL and
> READ_WRITE_ALL.
> >
> > 4. Calling resetLostPartitions doesn't provide a guarantee for full cache
> > operations availability if a topology doesn't have at least one owner for
> > each lost partition.
> >
> > The ultimate goal of the patch is to fix API inconsistencies and fix the
> > most crucial bugs related to data loss handling.
> >
> > *The planned changes are:*
> >
> > 1. The safe policy is used by default, except for in-memory grids with
> > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> case,
> > the unsafe policy is used by default. It protects from unintentional data
> > loss.
> >
> > 2. Lost state is never reset in the case of grid nodes restart (despite
> > full restart). It makes real data loss impossible in persistent grids if
> > following the recovery instruction.
> >
> > 3. Lost state is impossible to reset if a topology doesn't have at least
> > one owner for each lost partition. If nodes are physically dead, they
> > should be removed from a baseline first before calling
> resetLostPartitions.
> >
> > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> their
> > guarantees are impossible to fulfill, not on the full baseline.
> >
> > 5. Any operation failed due to data loss contains
> > CacheInvalidStateException as a root cause.
> >
> > In addition to code fixes, I plan to write a tutorial for safe data loss
> > recovery in the persistent mode in the Ignite wiki.
> >
> > Any comments for the proposed changes are welcome.
> >
> > [1]
> >
> >
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > partLossPlc)
> > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > [3] org.apache.ignite.IgniteCache#lostPartitions
> > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: [DISCUSS] Data loss handling improvements

Reply via email to