ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <a...@apache.org>:
> Alexei, > > 1,2,4,5 - looks good to me, no objections here. > > >> 3. Lost state is impossible to reset if a topology doesn't have at least > >> one owner for each lost partition. > > Do you mean that, according to your example, where > >> a node2 has left, soon a node3 has left. If the node2 is returned to > >> the topology first, it would have stale data for some keys. > we have to have node2 at cluster to be able to reset "lost" to node2's > data? > Not sure if I understand a question, but try to answer using an example: Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is owned by n2 and n3. 1. Topology is activated. 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1 3. n2 has failed. 4. cache.put(p, 1) // n3 has p->1, updateCounter=2 5. n3 has failed, partition loss is happened. 6. n2 joins a topology, it has stale data (p->0) We actually have 2 issues: 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is diverged and will not be adjusted by counters rebalancing if n3 is later joins a topology. or 8. n3 joins a topology, it has actual data (p->1) but rebalancing will not work because joining node has highest counter (it can only be a demander in this scenario). In both cases rebalancing by counters will not work causing data divergence in copies. > > >> at least one owner for each lost partition. > What the reason to have owners for all lost partitions when we want to > reset only some (available)? > It's never were possible to reset only subset of lost partitions. The reason is to make guarantee of resetLostPartitions method - all cache operations are resumed, data is correct. > Will it be possible to perform operations on non-lost partitions when the > cluster has at least one lost partition? > Yes it will be. > > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov < > alexey.scherbak...@gmail.com> wrote: > > > Folks, > > > > I've almost finished a patch bringing some improvements to the data loss > > handling code, and I wish to discuss proposed changes with the community > > before submitting. > > > > *The issue* > > > > During the grid's lifetime, it's possible to get into a situation when > some > > data nodes have failed or mistakenly stopped. If a number of stopped > nodes > > exceeds a certain threshold depending on configured backups, count a data > > loss will occur. For example, a grid having one backup (meaning at least > > two copies of each data partition exist at the same time) can tolerate > only > > one node loss at the time. Generally, data loss is guaranteed to occur if > > backups + 1 or more nodes have failed simultaneously using default > affinity > > function. > > > > For in-memory caches, data is lost forever. For persistent caches, data > is > > not physically lost and accessible again after failed nodes are returned > to > > the topology. > > > > Possible data loss should be taken into consideration while designing an > > application. > > > > > > > > *Consider an example: money is transferred from one deposit to another, > and > > all nodes holding data for one of the deposits are gone.In such a case, a > > transaction temporary cannot be completed until a cluster is recovered > from > > the data loss state. Ignoring this can cause data inconsistency.* > > It is necessary to have an API telling us if an operation is safe to > > complete from the perspective of data loss. > > > > Such an API exists for some time [1] [2] [3]. In short, a grid can be > > configured to switch caches to the partial availability mode if data loss > > is detected. > > > > Let's give two definitions according to the Javadoc for > > *PartitionLossPolicy*: > > > > · *Safe* (data loss handling) *policy* - cache operations are only > > available for non-lost partitions (PartitionLossPolicy != IGNORE). > > > > · *Unsafe policy* - cache operations are always possible > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost > > partitions automatically re-created on the remaining nodes if needed or > > immediately owned if a last supplier has left during rebalancing. > > > > *That needs to be fixed* > > > > 1. The default loss policy is unsafe, even for persistent caches in the > > current implementation. It can result in unintentional data loss and > > business invariants' failure. > > > > 2. Node restarts in the persistent grid with detected data loss will > cause > > automatic resetting of LOST state after the restart, even if the safe > > policy is configured. It can result in data loss or partition desync if > not > > all nodes are returned to the topology or returned in the wrong order. > > > > > > *An example: a grid has three nodes, one backup. The grid is under load. > > First, a node2 has left, soon a node3 has left. If the node2 is returned > to > > the topology first, it would have stale data for some keys. Most recent > > data are on node3, which is not in the topology yet. Because a lost state > > was reset, all caches are fully available, and most probably will become > > inconsistent even in safe mode.* > > 3. Configured loss policy doesn't provide guarantees described in the > > Javadoc depending on the cluster configuration[4]. In particular, unsafe > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not > > automatically readjusted on node left), because partitions are not > > automatically get reassigned on topology change, and no nodes are > existing > > to fulfill a read/write request. Same for READ_ONLY_ALL and > READ_WRITE_ALL. > > > > 4. Calling resetLostPartitions doesn't provide a guarantee for full cache > > operations availability if a topology doesn't have at least one owner for > > each lost partition. > > > > The ultimate goal of the patch is to fix API inconsistencies and fix the > > most crucial bugs related to data loss handling. > > > > *The planned changes are:* > > > > 1. The safe policy is used by default, except for in-memory grids with > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter > case, > > the unsafe policy is used by default. It protects from unintentional data > > loss. > > > > 2. Lost state is never reset in the case of grid nodes restart (despite > > full restart). It makes real data loss impossible in persistent grids if > > following the recovery instruction. > > > > 3. Lost state is impossible to reset if a topology doesn't have at least > > one owner for each lost partition. If nodes are physically dead, they > > should be removed from a baseline first before calling > resetLostPartitions. > > > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because > their > > guarantees are impossible to fulfill, not on the full baseline. > > > > 5. Any operation failed due to data loss contains > > CacheInvalidStateException as a root cause. > > > > In addition to code fixes, I plan to write a tutorial for safe data loss > > recovery in the persistent mode in the Ignite wiki. > > > > Any comments for the proposed changes are welcome. > > > > [1] > > > > > org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy > > partLossPlc) > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches) > > [3] org.apache.ignite.IgniteCache#lostPartitions > > [4] https://issues.apache.org/jira/browse/IGNITE-10041 > > [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean) > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long) > > > -- Best regards, Alexei Scherbakov