Alex,

CockroachDB is based on RAFT and is able to repair itself automatically [1]
[2]. Their approach looks reasonable to me and is pretty much similar to
MongoDB and Cassandra. In short, you distinguish between short-term and
long-term failures.
1) First, you wait for small time window in hope that it was a network
glitch or restart. Even if this was a segmentation, with true consensus
algorithm this is not an issue - you partitions or the whole cluster is
unavailable during this window.
2) Then, if majority is still there and cluster is operational you trigger
automatic rebalance.
3) Last, if you need fine-grained control you can tune or disable
auto-rebalance and do some manual magic.

This is very nice approach: it is simple for simple use cases and complex
for complex use cases. Ideally, this is how Ignite should work. Want to
play and write hello-world app? Just learn what cache is. Started
developing moderately complex application? Learn about affinity, cache
modes, etc.. Going to enterprise scale? Learn about BLAT, activation, etc..

It seems that old behavior without BLAT and even without manual activation
would be enough for majority of our users. At the very least it is enough
for order of magnitude more popular Cassandra and MongoDB.

[1]
https://www.cockroachlabs.com/docs/stable/frequently-asked-questions.html#how-does-cockroachdb-survive-failures
[2]
https://www.cockroachlabs.com/docs/stable/training/fault-tolerance-and-automated-repair.html

On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> Vladimir,
>
> Automatic cluster membership changes may be implemented to grow the
> topology, but auto-shrinking topology is usually not possible because a
> process cannot distinguish between a node shutdown and network
> partitioning. If we want to deal with split-brain scenarios as a grown-up
> system, we should change the replication strategy within partitions to a
> consensus algorithm (I really hope we will). None of the consensus
> algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
> adjustments based on a internally-detected process failure. I consider
> baseline topology as a step towards this model.
>
> Addressing your second concern, If a node was down for a short period of
> time, we should (and we do) rebalance only deltas, which is faster than
> erasing the whole node and moving all data from scratch.
>
> 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:
>
> > Ivan,
> >
> > This reasoning sounds questionable to me. First, separate logic for in
> > memory and persistent regions means that we loose collocation between
> > persistent and non persistent caches. Second, “data is still on disk”
> > assumption might be not valid if node has left due to disk crash, or when
> > data is updated on remaining nodes.
> >
> > вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <ivan.glu...@gmail.com>:
> >
> > > Stan,
> > >
> > > I believe it was discussed at the design proposal thread:
> > >
> > > http://apache-ignite-developers.2346864.n4.nabble.
> > com/Cluster-auto-activation-design-proposal-td20295.html
> > >
> > > The short answer: backup factor decreases if node leaves. In
> > > non-persistent mode we have to rebalance data ASAP - otherwise last
> node
> > > that owns partition may fail and data will be lost forever.
> > > This is not necessary if data is persisted to disk storage, that's the
> > > reason for Baseline Topology concept.
> > >
> > > Best Regards,
> > > Ivan Rakov
> > >
> > > On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > > > + for Vladimir's point - adding more complexity may (and likely will)
> > be
> > > > even more misleading.
> > > >
> > > > Can we take a step back and discuss why do we need to have different
> > > > behavior for persistent and in-memory caches? Can we make in-memory
> > > caches
> > > > honor baseline instead of special-casing them?
> > > >
> > > > Thanks,
> > > > Stan
> > > >
> > > >
> > > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <voze...@gridgain.com>:
> > > >
> > > >> Guys,
> > > >>
> > > >> As a user I definitely do not want to think about BLATs, SATs, DATs,
> > > >> whatsoever. I want to query data, iterate over data, send compute
> > tasks
> > > to
> > > >> data. If certain node is outside of BLAT and do not have data, then
> > > this is
> > > >> not affinity node. Can we just fix affinity logic to take in count
> > BLAT
> > > >> appropriately?
> > > >>
> > > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <ivan.glu...@gmail.com>
> > > wrote:
> > > >>
> > > >>> Eduard,
> > > >>>
> > > >>> Can you please summarize code changes that you are proposing?
> > > >>> I agree that BLT is a bit misleading term and DAT/SAT make more
> > sense.
> > > >>> However, establishing a consensus on v2.4 Baseline Topology
> > terminology
> > > >>> took a long time and seems like you are going to cause a bit more
> > > >>> perturbations.
> > > >>> I still don't understand what and how should be changed. Please
> > provide
> > > >>> summary of upcoming class renamings and changes of existing system
> > > parts.
> > > >>>
> > > >>> Best Regards,
> > > >>> Ivan Rakov
> > > >>>
> > > >>>
> > > >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> > > >>>
> > > >>>> Hi, Igniters,
> > > >>>>
> > > >>>> I want to raise a topic about our affinity node definition.
> > > >>>>
> > > >>>> After adding baseline (affinity) topology (BL(A)T) things start
> > being
> > > >>>> complicated.
> > > >>>>
> > > >>>> Plenty of bugs appears:
> > > >>>>
> > > >>>> IGNITE-8173
> > > >>>> ignite.getOrCreateCache(cacheConfig).iterator() method works
> > incorrect
> > > >>>> for
> > > >>>> replicated cache in case if some data node isn't in baseline
> > > >>>>
> > > >>>> IGNITE-7628
> > > >>>> SqlQuery hangs indefinitely with additional not registered in
> > baseline
> > > >>>> node.
> > > >>>>
> > > >>>> It's because everything relies on concept "affinity node".
> > > >>>> And until now it was as simple as a server node which passes node
> > > >> filter.
> > > >>>> Other words any server node which is not filtered out by node
> > filter.
> > > >>>>
> > > >>>> But node which is not in BL(A)T and which passes node filter would
> > be
> > > >>>> treated as affinity node. And it's definitely wrong. At least, it
> > is a
> > > >>>> source of many bugs (I believe there are much more than those 2
> > which
> > > I
> > > >>>> already have mentioned).
> > > >>>>
> > > >>>> It's clear that this definition should be changed.
> > > >>>> Let's start with a new definition of "Affinity topology". Affinity
> > > >>>> topology
> > > >>>> is a set of nodes which potentially could keep data.
> > > >>>>
> > > >>>> If we use knowledge about the current realization we can say that
> 1.
> > > for
> > > >>>> in-memory cache groups it would be all server nodes;
> > > >>>> 2. for persistent cache groups it would be BL(A)T.
> > > >>>>
> > > >>>> I will further use Dynamic Affinity Topology or DAT for 1
> (in-memory
> > > >> cache
> > > >>>> groups) and Static Affinity Topology or SAT instead BL(A)T, or 2nd
> > > >> point.
> > > >>>> Denote node filter as f(X), where X is affinity topology.
> > > >>>>
> > > >>>> Then we can say that node A is affinity node if
> > > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or SAT.
> > > >>>>
> > > >>>> It worth to mention that AT' should be used to pass to affinity
> > > function
> > > >>>> of
> > > >>>> cache groups.
> > > >>>> Also, AT and AT' could change during the time (BL(A)T changes or
> > node
> > > >>>> joins/disconnections).
> > > >>>>
> > > >>>> And I don't like fact that usage of DAT or SAT relies on
> persistence
> > > >>>> settings (Should we make it configurable per cache group?).
> > > >>>>
> > > >>>> Ok, I have created a ticket to implement this changes and will
> start
> > > >>>> working on it.
> > > >>>> https://issues.apache.org/jira/browse/IGNITE-8380 (Affinity node
> > > >>>> calculation doesn't take into account BLT).
> > > >>>>
> > > >>>> Also, I want to use these definitions (Affinity Topology, Affinity
> > > Node,
> > > >>>> DAT, SAT) in documentation and java docs.
> > > >>>>
> > > >>>> Maybe, we also should consider replacing BL(A)T with SAT.
> > > >>>>
> > > >>>> Thank you for your attention.
> > > >>>>
> > > >>>>
> > >
> > >
> >
>

Reply via email to