> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!

I am very happy to see some momentum on this, and I would like to go
even further than what you propose. The main reasons why I do not
think simply adding vnodes and making random assignments is the best
end goal are:

(1) The probability of data loss increases linearly with cluster size.
(2) It does not take network topology into account.

What follows is mostly a half-finished long text that I have been
sitting on for a few months but not finished/posted. Unfortunately I
do not have the possibility (due to time constraints) to go through
everything in detail and update with current information and to
specifically address what you already said, so there will be overlap
with your original post. However given your E-Mail and the momentum in
this thread, I really wanted to post something rather than not. It
would be awesome if interested parties had a chance to read the
referenced CRUSH paper, and the deltas proposed below.

The goals are to address everything you already wanted in your post,
while also addressing:

(1) Probability of data loss
(2) Network topology awareness

The following text will first try to paint a picture of the goals that
I have in mind, and then go on to the actual proposed solution. The
proposition is very very short and undetailed now and there is plenty
of discussion and details to fill in. I apologize, but again, I really
want to post something now that this is being brought up.

BEGIN un-polished text ("we" = "I"):=

= CRUSHing Cassandra

Author: Peter Schuller <peter.schul...@infidyne.com>

This is a proposal for a significant re-design of some fundamentals of
Cassandra, aimed at addressing a number of current issues as well as
anticipating future issues. It is particularly aimed at large
clusters, but as a side-effect should improve the small cluster
experience as well.

== New terminology: Distribution factor

A Cassandra cluster is today said to have `N` nodes, and data is
replicated at a particular replication factor (`RF`). The placement of
replicas is such that all rows that has a certain node `N1` as its
primary replica, are located on a specific set of `RF-1` other
nodes. In addition, it holds secondary replicas of data for `RF-1`
other nodes. In total, it shares data with `2RF - 2` other nodes.

The number of nodes with whom a node shares data is the distribution
factor. In the case of Cassandra, `DF = 2RF - 2`.

== Goals

The goals this suggestion attempts to help solve include the following.

=== Goal: DF should not be tied to RF, nor N

`DF` is important for these reasons:

* The `DF` determines how many nodes are involved in re-constructing a
  lost node after failure; the higher the `DF`, the less of a
  performance impact a reconstruction has on remaining nodes.
* The `DF` determines the significance on other nodes, with respect to
  read/write load, on a node being down.
* The `DF` affects the probability of multiple failures causing data
  loss, since one looses data if any `RF` nodes within a group of `DF`
  nodes all go down.

Having `DF` tied to `RF` like Cassandra does now has its problems. A
single node failure has a significant effect on the performance
characteristics of neighboring nodes (in terms relative to the normal
level of load on the neighbors).

On large data sets, a failed node needing reconstruction is a
significant event, as it

* Increases the load on neighbors just from going down.
* Further increases the load on neighbors as they have to stream data
(adding I/O
  and cache thrashing).

This typically leads to the desire to throttle / rate limit
reconstruction, which adds to the reconstruction window in addition to
the fact that it was already naturally bottlenecking on neighbors.

The other extreme is to tie `DF` to `N`, such that the data contained
on one node, has it's secondary replicas spread out over the entire
ring. This is an unacceptable choice because the probabiliy of multiple
failures increases linearly with the cluster size.

In other words, we want `DF` to be tied to neither `RF` nor
`N`. Rather, `DF` should be chosen as a trade-off between the effects
of `DF`:

* The higher the `DF`, the higher the probability of data loss in case
  of multiple failures.
* The higher the `DF`, the faster to reconstruct/replace a lost node.
* The higher the `DF`, the less impact is seen on node failures on the
  performance requirements on other nodes.

In making this determination, one must take into account that if a
larger `DF` makes reconstruction/replacement significantly faster,
that also decreases the time window in which multiple failures can
occurr. Increasing `DF` is thus not *necessarily* increasing the total
probability of data loss (for small values of `DF`).

=== Goal: Topologically aware redundancy

We maintain the goal of being topology aware for the purpose of
ensuring that we place replicas on "independent" nodes (for a
definition of "independent" that is up to the operator). This is
currently implemented in a limited fashion by the DC and rack
awareness offered by the `NetworkTopologyStrategy`.

We want to go further than that, and allow close to arbitrary
topologies to be modeled. For example, we may require, in a single
data center, that a piece of data is replicated to 3 distinct racks,
at least two of which must be within different cages in the data
center.

=== Goal: Topologically aware locality

Somewhat in conflict with the previous goal, there is also a desire to
keep replicas "close". In a sufficiently large cluster, it becomes
cost-prohibitive to pipe all inter-replica traffic through a fast
backbone network.

Ideally, one want to balance the two. For example, one could arrange
for replicas to be distributed across different racks in a cage in a
data center, limiting cross-replica network bandwidth requirements to
the local switch in said cage, while maintaining independent power
supply/networking to racks for redundancy.

With gigabit+ speeds from a node, it is very easy to saturate
networking infrastructure once you start having hundreds of nodes or
more.

=== Goal: Topologically aware routing

Once a cluster grows sufficiently large, even with topologically aware
locality, you eventually want to avoid the
everybody-talks-to-everybody situtation of current Cassandra, for
network efficiency reasons. This includes the clients.

Topologically aware routing would enable selection of a co-ordinator
node more intelligently (consider "edge" router nodes that are
Cassandra aware, to which application clients connect to - e.g. you
might have one such per data center, or part of a DC), and provided
sufficient site-specific integration, allow the routing of traffic to
that co-ordinator to be routed efficiently at the networking level.

A request could go all the way from a client, to a
co-ordinator/router, to responsible replicas, and back - never
crossing a backbone switch/router.

=== Goal: Agility

We want adding, removing or re-balancing nodes to be a trivial
operation for the operator's perspective, and to have a minimal impact
on the performance of the cluster. We want to enable adding individual
hosts, groups of hosts, or entire racks of hosts easily and without
significant risk of mistakes. There should be no need to carefully
plan and co-ordinate upgrades due to performance reasons.

The operator should have the flexibility of quickly adding capacity as
demands change (which they will, even with careful planning to
minimize it). There must not be artificial limitations like requiring
a doubling of the cluster in order to avoid very painful and
performance degrading operations. The operator should never be in a
position to have to consider the trae-off between adding 30% capacity
now at great cost in operator effort and risk, vs. waiting another two
weeks for that promised delivery of servers, for example.

Consider a large deployment with multiple clusters. It is desirable to
capacity project each cluster individually, and have a constantly
re-populated pool of servers that are used for expansion and/or
replacement of nodes. Adding capacity to existing cluters could be
done even on a daily basis, in an effort to keap average utilization
as smooth as possible (having a varying average utilization
contributes is one more factor that contributes to the need to
over-provision; in a 1000 node cluster expected to double in size in a
year for example, it is efficient to incrementally add capacity rather
than, say, add 500 on day one and 500 at the half-way point; if
nothing else, doing the latter would imply a 250 node average
over-provisioning during a year *just* due to lack of agility in
expansion, on top of other factors).

=== Goal: Heterogenous nodes

We want it to be realistic and minimally impactful to have nodes of
different performance characteristics in the cluster.

=== Goal: Operator simplicity

The current situation is unacceptably complex. Maintaing a multi-DC
cluster with rack awareness requires significant planning, care and
often automation on the part of the operator to maintain an
appropriately layed out ring, without causing rack hotspots or other
issues. Auditing correct ring layout is essentially impossible for a
human with the out-of-the-box tools, and currently require
site-specific scripting (e.g., `nodetool ring` is close to useless
without significant mental or automated post-processing).

We want to significantly increase operator simplicity (minimizing the
risk for operator error and minimize the effort required by an
operator).

== Proposed solution

The proposal draws inspiration from the CRUSH paper:

   http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

Currently, because this is an unfinished document, that paper is
pre-requisite reading.

Here are the deltas proposed:

=== Buckets and pre-calculation

The key on which we would apply the crush algorithm would not be the
Cassandra row key. Instead the Cassandra keyspace is partitioned (by
hashing) into N buckets. N must be large enough to accomodate future
cluster expansions, and to accomodate probabilistic effects when
applying a CRUSH-like algorithm so that we achieve balancing.

The main reasons to use buckets are:

* It allows pre-calculation of the results of the CRUSH-like algorithm, which
  removes the CPU/balancing trade-offs from the equation that the CRUSH paper
  focuses on. We can go with the optimally balanced solution.
* A "bucket" is tantamount to a ring segment. Similarly to how
Cassandra already does it,
  we want to efficiently be able to transfer a single bucket without
having to traverse
  the entire data set or a node-global index of all data. Thus,
buckets directly relate
  to structuring data on disk in a way which is non-orthogonal to the
unit of data being
  moved around on topology changes.
* It helps the balancing/accurace trade-off, see below.

=== Replica selection

In order to cater to Distrihution Factor, we need to use the primary
replica, as opposed to the bucket id, in the key when applying the
CRUSH-like algorithm on finding secondary replicas.

=== Balancing accuracy vs. RDF vs. unnecessary movement trade-off

When limiting replicas to be among RDF other nodes, you introduce an
artificial constraint on replica placement that causes greater
imbalance due to the low number of nodes you are distributing over
(each node has replicas for data otherwise primary on DF-1 nodes, and
assuming DF is not huge, DF-1 is a smallish number, and thus balancing
will be limited).

An algorithm to get around this is to calculate the placement in a
deterministic fashion (such as in order) with resperct to bucket
id. If, when assigning a secondary replica, the proposed replica is
already above quota, you reject the selection and keep going. This
achieves balance, but at the cost of additional data movement because
every time you perform a rejection, you are causing the placement of a
virtual bucket to be affected by something *other* than the nodes on
which it is placed.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to