> Out of curiousity, what was the reason for the 'join' command behaviour to > change?
1. Existing bugs/limitations. For example, joining two entire clusters together was not an entirely safe operation. In some cases, the newly formed cluster would not correctly converge, leaving the ring/cluster in flux. Likewise, we realized that many users were often joining two clusters together by accident and would prefer additional safety. In particular, joining two clusters together with overlapping data but no common vector clock relationship could result in data loss as unintended siblings were reconciled. 2. It was necessary consequence of how the new cluster code works. In the new cluster, the cluster state / ring is only ever mutated by a single node at a time. This is done by having a cluster-wide claimant, as mentioned in my original email. Given the claimant approach, all cluster state / ring changes are totally ordered. When a new node joins an existing cluster, it throws away it's existing ring and replaces it with a copy of the ring from the target cluster, thus joining into the same cluster history. If you were to join two clusters together, we would need to deterministically merge two independent cluster histories and elect a single new claimant for the new cluster. This is easy in cases where there are no node failures or net-splits during joining, but less trivial when there are errors. The entire new cluster code was heavily modeled before implementation, and in the modeling work several corner cases related to failures were found that were hard to address in a cluster/cluster join but easy to fix in a node/cluster join. Thus, I went with the simple and correct approach. -Joe -- Joseph Blomstedt <j...@basho.com> Software Engineer Basho Technologies, Inc. http://www.basho.com/ On Thu, Sep 8, 2011 at 5:19 AM, Jens Rantil <jens.ran...@telavox.se> wrote: > Out of curiousity, what was the reason for the 'join' command behaviour to > change? > > > > Regards, > > Jens > > > > ----------------------------------------------------------- > > Date: Wed, 7 Sep 2011 18:12:40 -0600 > > From: Joseph Blomstedt <j...@basho.com> > > To: riak-users Users <riak-users@lists.basho.com> > > Subject: Riak Clustering Changes in 1.0 > > Message-ID: > > > <CANvk2KRPpath-ZJaDuhZo0b+pEByn0MhxyJ-9LEB+b_12d=v...@mail.gmail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Given that 1.0 prerelease packages are now available, I wanted to > > mention some changes to Riak's clustering capabilities in 1.0. In > > particular, there are some subtle semantic differences in the > > riak-admin commands. More complete docs will be updated in the near > > future, but I hope a quick email suffices for now. > > > > [nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB > > to the cluster that nodeA is a member of. This is semantically > > different than pre-1.0 Riak in which join essentially joined clusters > > together rather than joined a node to a cluster. As part of this > > change, the joining node (nodeB in this case) must be a singleton > > (1-node) cluster. > > > > In pre-1.0, leave and remove were essentially the same operation, with > > leave just being an alias for 'remove this-node'. This has changed. > > Leave and remove are now very different operations. > > > > [nodeB/riak-admin leave] is the only safe way to have a node leave the > > cluster, and it must be executed by the node that you want to remove. > > In this case, nodeB will start leaving the cluster, and will not leave > > the cluster until after it has handed off all its data. Even if nodeB > > is restarted (crashed/shutdown/whatever), it will remain in the leave > > state and continue handing off partitions until done. After handoff, > > it will leave the cluster, and eventually shutdown. > > > > [nodeA/riak-admin remove nodeB] immediately removes nodeB from the > > cluster, without handing off its data. All replicas held by nodeB are > > therefore lost, and will need to be re-generated through read-repair. > > Use this command carefully. It's intended for nodes that are > > permanently unrecoverable and therefore for which handoff doesn't make > > sense. By the final 1.0 release, this command may be renamed > > "force-remove" just to make the distinction clear. > > > > There are now two new commands that provide additional insight into > > the cluster. [riak-admin member_status] and [riak-admin ring_status]. > > > > Underneath, the clustering protocol has been mostly re-written. The > > new approach has the following advantages: > > 1. It is no longer necessary to wait on [riak-admin ringready] in > > between adding/removing nodes from the cluster, and adding/removing is > > also much more sound/graceful. Starting up 16 nodes and issuing > > [nodeX: riak-admin join node1] for X=1:16 should just work. > > > > 2. Data is first transferred to new partition owners before handing > > over partition ownership. This change fixes numerous bugs, such as > > 404s/not_founds during ownership changes. The Ring/Pending columns in > > [riak-admin member_status] visualize this at a high-level, and the > > full transfer status in [riak-admin ring_status] provide additional > > insight. > > > > 3. All partition ownership decisions are now made by a single node in > > the cluster (the claimant). Any node can be the claimant, and the duty > > is automatically taken over if the previous claimant is removed from > > the cluster. [riak-admin member_status] will list the current > > claimant. > > > > 4. Handoff related to ownership changes can now occur under load; > > hinted handoff still only occurs when a vnode is inactive. This change > > allows a cluster to scale up/down under load, although this needs to > > be further benchmarked and tuned before 1.0. > > > > To support all of the above, a new limitation has been introduced. > > Cluster changes (member addition/removal, ring rebalance, etc) can > > only occur when all nodes are up and reachable. [riak-admin > > ring_status] will complain when this is not the case. If a node is > > down, you must issue [riak-admin down <node>] to mark the node as > > down, and the remaining nodes will then proceed to converge as usual. > > Once the down node comes back online, it will automatically > > re-integrate into the cluster. However, there is nothing preventing > > client requests being served by a down node before it re-integrates. > > Before issuing [down <node>], make sure to update your load balancers > > / connection pools to not include this node. Future releases of Riak > > may make offlining a node an automatic operation, but it's a > > user-initiated action in 1.0. > > > > -Joe > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com