Re: Riak Clustering Changes in 1.0

Joseph Blomstedt Thu, 08 Sep 2011 08:42:41 -0700

> Out of curiousity, what was the reason for the 'join' command behaviour to
> change?


1. Existing bugs/limitations. For example, joining two entire clusters
together was not an entirely safe operation. In some cases, the newly
formed cluster would not correctly converge, leaving the ring/cluster
in flux. Likewise, we realized that many users were often joining two
clusters together by accident and would prefer additional safety. In
particular, joining two clusters together with overlapping data but no
common vector clock relationship could result in data loss as
unintended siblings were reconciled.

2. It was necessary consequence of how the new cluster code works. In
the new cluster, the cluster state / ring is only ever mutated by a
single node at a time. This is done by having a cluster-wide claimant,
as mentioned in my original email. Given the claimant approach, all
cluster state / ring changes are totally ordered. When a new node
joins an existing cluster, it throws away it's existing ring and
replaces it with a copy of the ring from the target cluster, thus
joining into the same cluster history. If you were to join two
clusters together, we would need to deterministically merge two
independent cluster histories and elect a single new claimant for the
new cluster. This is easy in cases where there are no node failures or
net-splits during joining, but less trivial when there are errors. The
entire new cluster code was heavily modeled before implementation, and
in the modeling work several corner cases related to failures were
found that were hard to address in a cluster/cluster join but easy to
fix in a node/cluster join. Thus, I went with the simple and correct
approach.

-Joe

-- 
Joseph Blomstedt <j...@basho.com>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/

On Thu, Sep 8, 2011 at 5:19 AM, Jens Rantil <jens.ran...@telavox.se> wrote:
> Out of curiousity, what was the reason for the 'join' command behaviour to
> change?
>
>
>
> Regards,
>
> Jens
>
>
>
> -----------------------------------------------------------
>
> Date: Wed, 7 Sep 2011 18:12:40 -0600
>
> From: Joseph Blomstedt <j...@basho.com>
>
> To: riak-users Users <riak-users@lists.basho.com>
>
> Subject: Riak Clustering Changes in 1.0
>
> Message-ID:
>
>
> <CANvk2KRPpath-ZJaDuhZo0b+pEByn0MhxyJ-9LEB+b_12d=v...@mail.gmail.com>
>
> Content-Type: text/plain; charset=ISO-8859-1
>
>
>
> Given that 1.0 prerelease packages are now available, I wanted to
>
> mention some changes to Riak's clustering capabilities in 1.0. In
>
> particular, there are some subtle semantic differences in the
>
> riak-admin commands. More complete docs will be updated in the near
>
> future, but I hope a quick email suffices for now.
>
>
>
> [nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB
>
> to the cluster that nodeA is a member of. This is semantically
>
> different than pre-1.0 Riak in which join essentially joined clusters
>
> together rather than joined a node to a cluster. As part of this
>
> change, the joining node (nodeB in this case) must be a singleton
>
> (1-node) cluster.
>
>
>
> In pre-1.0, leave and remove were essentially the same operation, with
>
> leave just being an alias for 'remove this-node'. This has changed.
>
> Leave and remove are now very different operations.
>
>
>
> [nodeB/riak-admin leave] is the only safe way to have a node leave the
>
> cluster, and it must be executed by the node that you want to remove.
>
> In this case, nodeB will start leaving the cluster, and will not leave
>
> the cluster until after it has handed off all its data. Even if nodeB
>
> is restarted (crashed/shutdown/whatever), it will remain in the leave
>
> state and continue handing off partitions until done. After handoff,
>
> it will leave the cluster, and eventually shutdown.
>
>
>
> [nodeA/riak-admin remove nodeB] immediately removes nodeB from the
>
> cluster, without handing off its data. All replicas held by nodeB are
>
> therefore lost, and will need to be re-generated through read-repair.
>
> Use this command carefully. It's intended for nodes that are
>
> permanently unrecoverable and therefore for which handoff doesn't make
>
> sense. By the final 1.0 release, this command may be renamed
>
> "force-remove" just to make the distinction clear.
>
>
>
> There are now two new commands that provide additional insight into
>
> the cluster. [riak-admin member_status] and [riak-admin ring_status].
>
>
>
> Underneath, the clustering protocol has been mostly re-written. The
>
> new approach has the following advantages:
>
> 1. It is no longer necessary to wait on [riak-admin ringready] in
>
> between adding/removing nodes from the cluster, and adding/removing is
>
> also much more sound/graceful. Starting up 16 nodes and issuing
>
> [nodeX: riak-admin join node1] for X=1:16 should just work.
>
>
>
> 2. Data is first transferred to new partition owners before handing
>
> over partition ownership. This change fixes numerous bugs, such as
>
> 404s/not_founds during ownership changes. The Ring/Pending columns in
>
> [riak-admin member_status] visualize this at a high-level, and the
>
> full transfer status in [riak-admin ring_status] provide additional
>
> insight.
>
>
>
> 3. All partition ownership decisions are now made by a single node in
>
> the cluster (the claimant). Any node can be the claimant, and the duty
>
> is automatically taken over if the previous claimant is removed from
>
> the cluster. [riak-admin member_status] will list the current
>
> claimant.
>
>
>
> 4. Handoff related to ownership changes can now occur under load;
>
> hinted handoff still only occurs when a vnode is inactive. This change
>
> allows a cluster to scale up/down under load, although this needs to
>
> be further benchmarked and tuned before 1.0.
>
>
>
> To support all of the above, a new limitation has been introduced.
>
> Cluster changes (member addition/removal, ring rebalance, etc) can
>
> only occur when all nodes are up and reachable. [riak-admin
>
> ring_status] will complain when this is not the case. If a node is
>
> down, you must issue [riak-admin down <node>] to mark the node as
>
> down, and the remaining nodes will then proceed to converge as usual.
>
> Once the down node comes back online, it will automatically
>
> re-integrate into the cluster. However, there is nothing preventing
>
> client requests being served by a down node before it re-integrates.
>
> Before issuing [down <node>], make sure to update your load balancers
>
> / connection pools to not include this node. Future releases of Riak
>
> may make offlining a node an automatic operation, but it's a
>
> user-initiated action in 1.0.
>
>
>
> -Joe
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak Clustering Changes in 1.0

Reply via email to