Have not chance to read this thread carefully, but a following discussion sounds very similar and might be somehow useful [1].
[1] http://apache-ignite-developers.2346864.n4.nabble.com/Partition-Loss-Policies-issues-td37304.html Best regards, Ivan Pavlukhin чт, 7 мая 2020 г. в 11:36, Alexei Scherbakov <alexey.scherbak...@gmail.com>: > > Yes, it will work this way. > > чт, 7 мая 2020 г. в 10:43, Anton Vinogradov <a...@apache.org>: > > > Seems I got the vision, thanks. > > There should be only 2 ways to reset lost partition: to gain an owner from > > resurrected first or to remove ex-owner from baseline (partition will be > > rearranged). > > And we should make a decision for every lost partition before calling the > > reset. > > > > On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov < > > alexey.scherbak...@gmail.com> wrote: > > > > > ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <a...@apache.org>: > > > > > > > Alexei, > > > > > > > > 1,2,4,5 - looks good to me, no objections here. > > > > > > > > >> 3. Lost state is impossible to reset if a topology doesn't have at > > > least > > > > >> one owner for each lost partition. > > > > > > > > Do you mean that, according to your example, where > > > > >> a node2 has left, soon a node3 has left. If the node2 is returned to > > > > >> the topology first, it would have stale data for some keys. > > > > we have to have node2 at cluster to be able to reset "lost" to node2's > > > > data? > > > > > > > > > > Not sure if I understand a question, but try to answer using an example: > > > Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is > > > owned by n2 and n3. > > > 1. Topology is activated. > > > 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1 > > > 3. n2 has failed. > > > 4. cache.put(p, 1) // n3 has p->1, updateCounter=2 > > > 5. n3 has failed, partition loss is happened. > > > 6. n2 joins a topology, it has stale data (p->0) > > > > > > We actually have 2 issues: > > > 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is > > diverged > > > and will not be adjusted by counters rebalancing if n3 is later joins a > > > topology. > > > or > > > 8. n3 joins a topology, it has actual data (p->1) but rebalancing will > > not > > > work because joining node has highest counter (it can only be a demander > > in > > > this scenario). > > > > > > In both cases rebalancing by counters will not work causing data > > divergence > > > in copies. > > > > > > > > > > > > > > >> at least one owner for each lost partition. > > > > What the reason to have owners for all lost partitions when we want to > > > > reset only some (available)? > > > > > > > > > > It's never were possible to reset only subset of lost partitions. The > > > reason is to make guarantee of resetLostPartitions method - all cache > > > operations are resumed, data is correct. > > > > > > > > > > Will it be possible to perform operations on non-lost partitions when > > the > > > > cluster has at least one lost partition? > > > > > > > > > > Yes it will be. > > > > > > > > > > > > > > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov < > > > > alexey.scherbak...@gmail.com> wrote: > > > > > > > > > Folks, > > > > > > > > > > I've almost finished a patch bringing some improvements to the data > > > loss > > > > > handling code, and I wish to discuss proposed changes with the > > > community > > > > > before submitting. > > > > > > > > > > *The issue* > > > > > > > > > > During the grid's lifetime, it's possible to get into a situation > > when > > > > some > > > > > data nodes have failed or mistakenly stopped. If a number of stopped > > > > nodes > > > > > exceeds a certain threshold depending on configured backups, count a > > > data > > > > > loss will occur. For example, a grid having one backup (meaning at > > > least > > > > > two copies of each data partition exist at the same time) can > > tolerate > > > > only > > > > > one node loss at the time. Generally, data loss is guaranteed to > > occur > > > if > > > > > backups + 1 or more nodes have failed simultaneously using default > > > > affinity > > > > > function. > > > > > > > > > > For in-memory caches, data is lost forever. For persistent caches, > > data > > > > is > > > > > not physically lost and accessible again after failed nodes are > > > returned > > > > to > > > > > the topology. > > > > > > > > > > Possible data loss should be taken into consideration while designing > > > an > > > > > application. > > > > > > > > > > > > > > > > > > > > *Consider an example: money is transferred from one deposit to > > another, > > > > and > > > > > all nodes holding data for one of the deposits are gone.In such a > > > case, a > > > > > transaction temporary cannot be completed until a cluster is > > recovered > > > > from > > > > > the data loss state. Ignoring this can cause data inconsistency.* > > > > > It is necessary to have an API telling us if an operation is safe to > > > > > complete from the perspective of data loss. > > > > > > > > > > Such an API exists for some time [1] [2] [3]. In short, a grid can be > > > > > configured to switch caches to the partial availability mode if data > > > loss > > > > > is detected. > > > > > > > > > > Let's give two definitions according to the Javadoc for > > > > > *PartitionLossPolicy*: > > > > > > > > > > · *Safe* (data loss handling) *policy* - cache operations are only > > > > > available for non-lost partitions (PartitionLossPolicy != IGNORE). > > > > > > > > > > · *Unsafe policy* - cache operations are always possible > > > > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, > > > lost > > > > > partitions automatically re-created on the remaining nodes if needed > > or > > > > > immediately owned if a last supplier has left during rebalancing. > > > > > > > > > > *That needs to be fixed* > > > > > > > > > > 1. The default loss policy is unsafe, even for persistent caches in > > the > > > > > current implementation. It can result in unintentional data loss and > > > > > business invariants' failure. > > > > > > > > > > 2. Node restarts in the persistent grid with detected data loss will > > > > cause > > > > > automatic resetting of LOST state after the restart, even if the safe > > > > > policy is configured. It can result in data loss or partition desync > > if > > > > not > > > > > all nodes are returned to the topology or returned in the wrong > > order. > > > > > > > > > > > > > > > *An example: a grid has three nodes, one backup. The grid is under > > > load. > > > > > First, a node2 has left, soon a node3 has left. If the node2 is > > > returned > > > > to > > > > > the topology first, it would have stale data for some keys. Most > > recent > > > > > data are on node3, which is not in the topology yet. Because a lost > > > state > > > > > was reset, all caches are fully available, and most probably will > > > become > > > > > inconsistent even in safe mode.* > > > > > 3. Configured loss policy doesn't provide guarantees described in the > > > > > Javadoc depending on the cluster configuration[4]. In particular, > > > unsafe > > > > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not > > > > > automatically readjusted on node left), because partitions are not > > > > > automatically get reassigned on topology change, and no nodes are > > > > existing > > > > > to fulfill a read/write request. Same for READ_ONLY_ALL and > > > > READ_WRITE_ALL. > > > > > > > > > > 4. Calling resetLostPartitions doesn't provide a guarantee for full > > > cache > > > > > operations availability if a topology doesn't have at least one owner > > > for > > > > > each lost partition. > > > > > > > > > > The ultimate goal of the patch is to fix API inconsistencies and fix > > > the > > > > > most crucial bugs related to data loss handling. > > > > > > > > > > *The planned changes are:* > > > > > > > > > > 1. The safe policy is used by default, except for in-memory grids > > with > > > > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter > > > > case, > > > > > the unsafe policy is used by default. It protects from unintentional > > > data > > > > > loss. > > > > > > > > > > 2. Lost state is never reset in the case of grid nodes restart > > (despite > > > > > full restart). It makes real data loss impossible in persistent grids > > > if > > > > > following the recovery instruction. > > > > > > > > > > 3. Lost state is impossible to reset if a topology doesn't have at > > > least > > > > > one owner for each lost partition. If nodes are physically dead, they > > > > > should be removed from a baseline first before calling > > > > resetLostPartitions. > > > > > > > > > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because > > > > their > > > > > guarantees are impossible to fulfill, not on the full baseline. > > > > > > > > > > 5. Any operation failed due to data loss contains > > > > > CacheInvalidStateException as a root cause. > > > > > > > > > > In addition to code fixes, I plan to write a tutorial for safe data > > > loss > > > > > recovery in the persistent mode in the Ignite wiki. > > > > > > > > > > Any comments for the proposed changes are welcome. > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy > > > > > partLossPlc) > > > > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches) > > > > > [3] org.apache.ignite.IgniteCache#lostPartitions > > > > > [4] https://issues.apache.org/jira/browse/IGNITE-10041 > > > > > [5] > > org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean) > > > > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long) > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > Alexei Scherbakov > > > > > > > > -- > > Best regards, > Alexei Scherbakov