Re: [DISCUSS] Data loss handling improvements

Ivan Pavlukhin Tue, 12 May 2020 05:33:01 -0700

Have not chance to read this thread carefully, but a following
discussion sounds very similar and might be somehow useful [1].


[1] 
http://apache-ignite-developers.2346864.n4.nabble.com/Partition-Loss-Policies-issues-td37304.html

Best regards,
Ivan Pavlukhin

чт, 7 мая 2020 г. в 11:36, Alexei Scherbakov <alexey.scherbak...@gmail.com>:
>
> Yes, it will work this way.
>
> чт, 7 мая 2020 г. в 10:43, Anton Vinogradov <a...@apache.org>:
>
> > Seems I got the vision, thanks.
> > There should be only 2 ways to reset lost partition: to gain an owner from
> > resurrected first or to remove ex-owner from baseline (partition will be
> > rearranged).
> > And we should make a decision for every lost partition before calling the
> > reset.
> >
> > On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov <
> > alexey.scherbak...@gmail.com> wrote:
> >
> > > ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <a...@apache.org>:
> > >
> > > > Alexei,
> > > >
> > > > 1,2,4,5 - looks good to me, no objections here.
> > > >
> > > > >> 3. Lost state is impossible to reset if a topology doesn't have at
> > > least
> > > > >> one owner for each lost partition.
> > > >
> > > > Do you mean that, according to your example, where
> > > > >> a node2 has left, soon a node3 has left. If the node2 is returned to
> > > > >> the topology first, it would have stale data for some keys.
> > > > we have to have node2 at cluster to be able to reset "lost" to node2's
> > > > data?
> > > >
> > >
> > > Not sure if I understand a question, but try to answer using an example:
> > > Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
> > > owned by n2 and n3.
> > > 1. Topology is activated.
> > > 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
> > > 3. n2 has failed.
> > > 4. cache.put(p, 1) // n3 has p->1, updateCounter=2
> > > 5. n3 has failed, partition loss is happened.
> > > 6. n2 joins a topology, it has stale data (p->0)
> > >
> > > We actually have 2 issues:
> > > 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is
> > diverged
> > > and will not be adjusted by counters rebalancing if n3 is later joins a
> > > topology.
> > > or
> > > 8. n3 joins a topology, it has actual data (p->1) but rebalancing will
> > not
> > > work because joining node has highest counter (it can only be a demander
> > in
> > > this scenario).
> > >
> > > In both cases rebalancing by counters will not work causing data
> > divergence
> > > in copies.
> > >
> > >
> > > >
> > > > >> at least one owner for each lost partition.
> > > > What the reason to have owners for all lost partitions when we want to
> > > > reset only some (available)?
> > > >
> > >
> > > It's never were possible to reset only subset of lost partitions. The
> > > reason is to make guarantee of resetLostPartitions method - all cache
> > > operations are resumed, data is correct.
> > >
> > >
> > > > Will it be possible to perform operations on non-lost partitions when
> > the
> > > > cluster has at least one lost partition?
> > > >
> > >
> > > Yes it will be.
> > >
> > >
> > > >
> > > > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> > > > alexey.scherbak...@gmail.com> wrote:
> > > >
> > > > > Folks,
> > > > >
> > > > > I've almost finished a patch bringing some improvements to the data
> > > loss
> > > > > handling code, and I wish to discuss proposed changes with the
> > > community
> > > > > before submitting.
> > > > >
> > > > > *The issue*
> > > > >
> > > > > During the grid's lifetime, it's possible to get into a situation
> > when
> > > > some
> > > > > data nodes have failed or mistakenly stopped. If a number of stopped
> > > > nodes
> > > > > exceeds a certain threshold depending on configured backups, count a
> > > data
> > > > > loss will occur. For example, a grid having one backup (meaning at
> > > least
> > > > > two copies of each data partition exist at the same time) can
> > tolerate
> > > > only
> > > > > one node loss at the time. Generally, data loss is guaranteed to
> > occur
> > > if
> > > > > backups + 1 or more nodes have failed simultaneously using default
> > > > affinity
> > > > > function.
> > > > >
> > > > > For in-memory caches, data is lost forever. For persistent caches,
> > data
> > > > is
> > > > > not physically lost and accessible again after failed nodes are
> > > returned
> > > > to
> > > > > the topology.
> > > > >
> > > > > Possible data loss should be taken into consideration while designing
> > > an
> > > > > application.
> > > > >
> > > > >
> > > > >
> > > > > *Consider an example: money is transferred from one deposit to
> > another,
> > > > and
> > > > > all nodes holding data for one of the deposits are gone.In such a
> > > case, a
> > > > > transaction temporary cannot be completed until a cluster is
> > recovered
> > > > from
> > > > > the data loss state. Ignoring this can cause data inconsistency.*
> > > > > It is necessary to have an API telling us if an operation is safe to
> > > > > complete from the perspective of data loss.
> > > > >
> > > > > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > > > > configured to switch caches to the partial availability mode if data
> > > loss
> > > > > is detected.
> > > > >
> > > > > Let's give two definitions according to the Javadoc for
> > > > > *PartitionLossPolicy*:
> > > > >
> > > > > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > > > > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> > > > >
> > > > > ·   *Unsafe policy* - cache operations are always possible
> > > > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured,
> > > lost
> > > > > partitions automatically re-created on the remaining nodes if needed
> > or
> > > > > immediately owned if a last supplier has left during rebalancing.
> > > > >
> > > > > *That needs to be fixed*
> > > > >
> > > > > 1. The default loss policy is unsafe, even for persistent caches in
> > the
> > > > > current implementation. It can result in unintentional data loss and
> > > > > business invariants' failure.
> > > > >
> > > > > 2. Node restarts in the persistent grid with detected data loss will
> > > > cause
> > > > > automatic resetting of LOST state after the restart, even if the safe
> > > > > policy is configured. It can result in data loss or partition desync
> > if
> > > > not
> > > > > all nodes are returned to the topology or returned in the wrong
> > order.
> > > > >
> > > > >
> > > > > *An example: a grid has three nodes, one backup. The grid is under
> > > load.
> > > > > First, a node2 has left, soon a node3 has left. If the node2 is
> > > returned
> > > > to
> > > > > the topology first, it would have stale data for some keys. Most
> > recent
> > > > > data are on node3, which is not in the topology yet. Because a lost
> > > state
> > > > > was reset, all caches are fully available, and most probably will
> > > become
> > > > > inconsistent even in safe mode.*
> > > > > 3. Configured loss policy doesn't provide guarantees described in the
> > > > > Javadoc depending on the cluster configuration[4]. In particular,
> > > unsafe
> > > > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > > > > automatically readjusted on node left), because partitions are not
> > > > > automatically get reassigned on topology change, and no nodes are
> > > > existing
> > > > > to fulfill a read/write request. Same for READ_ONLY_ALL and
> > > > READ_WRITE_ALL.
> > > > >
> > > > > 4. Calling resetLostPartitions doesn't provide a guarantee for full
> > > cache
> > > > > operations availability if a topology doesn't have at least one owner
> > > for
> > > > > each lost partition.
> > > > >
> > > > > The ultimate goal of the patch is to fix API inconsistencies and fix
> > > the
> > > > > most crucial bugs related to data loss handling.
> > > > >
> > > > > *The planned changes are:*
> > > > >
> > > > > 1. The safe policy is used by default, except for in-memory grids
> > with
> > > > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> > > > case,
> > > > > the unsafe policy is used by default. It protects from unintentional
> > > data
> > > > > loss.
> > > > >
> > > > > 2. Lost state is never reset in the case of grid nodes restart
> > (despite
> > > > > full restart). It makes real data loss impossible in persistent grids
> > > if
> > > > > following the recovery instruction.
> > > > >
> > > > > 3. Lost state is impossible to reset if a topology doesn't have at
> > > least
> > > > > one owner for each lost partition. If nodes are physically dead, they
> > > > > should be removed from a baseline first before calling
> > > > resetLostPartitions.
> > > > >
> > > > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> > > > their
> > > > > guarantees are impossible to fulfill, not on the full baseline.
> > > > >
> > > > > 5. Any operation failed due to data loss contains
> > > > > CacheInvalidStateException as a root cause.
> > > > >
> > > > > In addition to code fixes, I plan to write a tutorial for safe data
> > > loss
> > > > > recovery in the persistent mode in the Ignite wiki.
> > > > >
> > > > > Any comments for the proposed changes are welcome.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > > > > partLossPlc)
> > > > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > > > > [3] org.apache.ignite.IgniteCache#lostPartitions
> > > > > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > > > > [5]
> > org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > > > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov

Re: [DISCUSS] Data loss handling improvements

Reply via email to