Right, as far as I understand we are not arguing on whether BLT is needed or not. The main questions are how to properly deliver this feature to users and how to deal with co-location issues between persistent and non-persistent caches. Looks like change policies are the way to go for the first question.
As far as co-location, it is important to note that different affinity distribution for in-memory and persistent caches automatically means that we loose SQL joins and predictable behavior of any affinity-based operations. It means that if we calculated the same affinity for persistent and in-memory caches at some point, we cannot re-distribute in-memory caches differently if some nodes go down without breaking co-located computations, am I right? On Tue, Apr 24, 2018 at 10:19 PM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote: > Well, this means that the concept of baseline is still needed because we > must not reassign partitions immediately (note that this is not identical > to rebalance delay!). The approach you describe is identical to baseline > change policies and I have nothing against this, their implementation was > planned to phase II of baseline changes. > > 2018-04-24 21:31 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > > > Alex, > > > > CockroachDB is based on RAFT and is able to repair itself automatically > [1] > > [2]. Their approach looks reasonable to me and is pretty much similar to > > MongoDB and Cassandra. In short, you distinguish between short-term and > > long-term failures. > > 1) First, you wait for small time window in hope that it was a network > > glitch or restart. Even if this was a segmentation, with true consensus > > algorithm this is not an issue - you partitions or the whole cluster is > > unavailable during this window. > > 2) Then, if majority is still there and cluster is operational you > trigger > > automatic rebalance. > > 3) Last, if you need fine-grained control you can tune or disable > > auto-rebalance and do some manual magic. > > > > This is very nice approach: it is simple for simple use cases and complex > > for complex use cases. Ideally, this is how Ignite should work. Want to > > play and write hello-world app? Just learn what cache is. Started > > developing moderately complex application? Learn about affinity, cache > > modes, etc.. Going to enterprise scale? Learn about BLAT, activation, > etc.. > > > > It seems that old behavior without BLAT and even without manual > activation > > would be enough for majority of our users. At the very least it is enough > > for order of magnitude more popular Cassandra and MongoDB. > > > > [1] > > https://www.cockroachlabs.com/docs/stable/frequently-asked- > > questions.html#how-does-cockroachdb-survive-failures > > [2] > > https://www.cockroachlabs.com/docs/stable/training/fault- > > tolerance-and-automated-repair.html > > > > On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk < > > alexey.goncha...@gmail.com> wrote: > > > > > Vladimir, > > > > > > Automatic cluster membership changes may be implemented to grow the > > > topology, but auto-shrinking topology is usually not possible because a > > > process cannot distinguish between a node shutdown and network > > > partitioning. If we want to deal with split-brain scenarios as a > grown-up > > > system, we should change the replication strategy within partitions to > a > > > consensus algorithm (I really hope we will). None of the consensus > > > algorithms (at least known to me - paxos, raft, ZAB) do auto cluster > > > adjustments based on a internally-detected process failure. I consider > > > baseline topology as a step towards this model. > > > > > > Addressing your second concern, If a node was down for a short period > of > > > time, we should (and we do) rebalance only deltas, which is faster than > > > erasing the whole node and moving all data from scratch. > > > > > > 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > > > > > > > Ivan, > > > > > > > > This reasoning sounds questionable to me. First, separate logic for > in > > > > memory and persistent regions means that we loose collocation between > > > > persistent and non persistent caches. Second, “data is still on disk” > > > > assumption might be not valid if node has left due to disk crash, or > > when > > > > data is updated on remaining nodes. > > > > > > > > вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <ivan.glu...@gmail.com>: > > > > > > > > > Stan, > > > > > > > > > > I believe it was discussed at the design proposal thread: > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble. > > > > com/Cluster-auto-activation-design-proposal-td20295.html > > > > > > > > > > The short answer: backup factor decreases if node leaves. In > > > > > non-persistent mode we have to rebalance data ASAP - otherwise last > > > node > > > > > that owns partition may fail and data will be lost forever. > > > > > This is not necessary if data is persisted to disk storage, that's > > the > > > > > reason for Baseline Topology concept. > > > > > > > > > > Best Regards, > > > > > Ivan Rakov > > > > > > > > > > On 24.04.2018 18:48, Stanislav Lukyanov wrote: > > > > > > + for Vladimir's point - adding more complexity may (and likely > > will) > > > > be > > > > > > even more misleading. > > > > > > > > > > > > Can we take a step back and discuss why do we need to have > > different > > > > > > behavior for persistent and in-memory caches? Can we make > in-memory > > > > > caches > > > > > > honor baseline instead of special-casing them? > > > > > > > > > > > > Thanks, > > > > > > Stan > > > > > > > > > > > > > > > > > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <voze...@gridgain.com > >: > > > > > > > > > > > >> Guys, > > > > > >> > > > > > >> As a user I definitely do not want to think about BLATs, SATs, > > DATs, > > > > > >> whatsoever. I want to query data, iterate over data, send > compute > > > > tasks > > > > > to > > > > > >> data. If certain node is outside of BLAT and do not have data, > > then > > > > > this is > > > > > >> not affinity node. Can we just fix affinity logic to take in > count > > > > BLAT > > > > > >> appropriately? > > > > > >> > > > > > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov < > > ivan.glu...@gmail.com> > > > > > wrote: > > > > > >> > > > > > >>> Eduard, > > > > > >>> > > > > > >>> Can you please summarize code changes that you are proposing? > > > > > >>> I agree that BLT is a bit misleading term and DAT/SAT make more > > > > sense. > > > > > >>> However, establishing a consensus on v2.4 Baseline Topology > > > > terminology > > > > > >>> took a long time and seems like you are going to cause a bit > more > > > > > >>> perturbations. > > > > > >>> I still don't understand what and how should be changed. Please > > > > provide > > > > > >>> summary of upcoming class renamings and changes of existing > > system > > > > > parts. > > > > > >>> > > > > > >>> Best Regards, > > > > > >>> Ivan Rakov > > > > > >>> > > > > > >>> > > > > > >>> On 24.04.2018 17:46, Eduard Shangareev wrote: > > > > > >>> > > > > > >>>> Hi, Igniters, > > > > > >>>> > > > > > >>>> I want to raise a topic about our affinity node definition. > > > > > >>>> > > > > > >>>> After adding baseline (affinity) topology (BL(A)T) things > start > > > > being > > > > > >>>> complicated. > > > > > >>>> > > > > > >>>> Plenty of bugs appears: > > > > > >>>> > > > > > >>>> IGNITE-8173 > > > > > >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works > > > > incorrect > > > > > >>>> for > > > > > >>>> replicated cache in case if some data node isn't in baseline > > > > > >>>> > > > > > >>>> IGNITE-7628 > > > > > >>>> SqlQuery hangs indefinitely with additional not registered in > > > > baseline > > > > > >>>> node. > > > > > >>>> > > > > > >>>> It's because everything relies on concept "affinity node". > > > > > >>>> And until now it was as simple as a server node which passes > > node > > > > > >> filter. > > > > > >>>> Other words any server node which is not filtered out by node > > > > filter. > > > > > >>>> > > > > > >>>> But node which is not in BL(A)T and which passes node filter > > would > > > > be > > > > > >>>> treated as affinity node. And it's definitely wrong. At least, > > it > > > > is a > > > > > >>>> source of many bugs (I believe there are much more than those > 2 > > > > which > > > > > I > > > > > >>>> already have mentioned). > > > > > >>>> > > > > > >>>> It's clear that this definition should be changed. > > > > > >>>> Let's start with a new definition of "Affinity topology". > > Affinity > > > > > >>>> topology > > > > > >>>> is a set of nodes which potentially could keep data. > > > > > >>>> > > > > > >>>> If we use knowledge about the current realization we can say > > that > > > 1. > > > > > for > > > > > >>>> in-memory cache groups it would be all server nodes; > > > > > >>>> 2. for persistent cache groups it would be BL(A)T. > > > > > >>>> > > > > > >>>> I will further use Dynamic Affinity Topology or DAT for 1 > > > (in-memory > > > > > >> cache > > > > > >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or > > 2nd > > > > > >> point. > > > > > >>>> Denote node filter as f(X), where X is affinity topology. > > > > > >>>> > > > > > >>>> Then we can say that node A is affinity node if > > > > > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT. > > > > > >>>> > > > > > >>>> It worth to mention that AT' should be used to pass to > affinity > > > > > function > > > > > >>>> of > > > > > >>>> cache groups. > > > > > >>>> Also, AT and AT' could change during the time (BL(A)T changes > or > > > > node > > > > > >>>> joins/disconnections). > > > > > >>>> > > > > > >>>> And I don't like fact that usage of DAT or SAT relies on > > > persistence > > > > > >>>> settings (Should we make it configurable per cache group?). > > > > > >>>> > > > > > >>>> Ok, I have created a ticket to implement this changes and will > > > start > > > > > >>>> working on it. > > > > > >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity > > node > > > > > >>>> calculation doesn't take into account BLT). > > > > > >>>> > > > > > >>>> Also, I want to use these definitions (Affinity Topology, > > Affinity > > > > > Node, > > > > > >>>> DAT, SAT) in documentation and java docs. > > > > > >>>> > > > > > >>>> Maybe, we also should consider replacing BL(A)T with SAT. > > > > > >>>> > > > > > >>>> Thank you for your attention. > > > > > >>>> > > > > > >>>> > > > > > > > > > > > > > > > > > > > >