> *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Cassandra and we are asking for comments, > criticism and collaboration!
I am very happy to see some momentum on this, and I would like to go even further than what you propose. The main reasons why I do not think simply adding vnodes and making random assignments is the best end goal are: (1) The probability of data loss increases linearly with cluster size. (2) It does not take network topology into account. What follows is mostly a half-finished long text that I have been sitting on for a few months but not finished/posted. Unfortunately I do not have the possibility (due to time constraints) to go through everything in detail and update with current information and to specifically address what you already said, so there will be overlap with your original post. However given your E-Mail and the momentum in this thread, I really wanted to post something rather than not. It would be awesome if interested parties had a chance to read the referenced CRUSH paper, and the deltas proposed below. The goals are to address everything you already wanted in your post, while also addressing: (1) Probability of data loss (2) Network topology awareness The following text will first try to paint a picture of the goals that I have in mind, and then go on to the actual proposed solution. The proposition is very very short and undetailed now and there is plenty of discussion and details to fill in. I apologize, but again, I really want to post something now that this is being brought up. BEGIN un-polished text ("we" = "I"):= = CRUSHing Cassandra Author: Peter Schuller <peter.schul...@infidyne.com> This is a proposal for a significant re-design of some fundamentals of Cassandra, aimed at addressing a number of current issues as well as anticipating future issues. It is particularly aimed at large clusters, but as a side-effect should improve the small cluster experience as well. == New terminology: Distribution factor A Cassandra cluster is today said to have `N` nodes, and data is replicated at a particular replication factor (`RF`). The placement of replicas is such that all rows that has a certain node `N1` as its primary replica, are located on a specific set of `RF-1` other nodes. In addition, it holds secondary replicas of data for `RF-1` other nodes. In total, it shares data with `2RF - 2` other nodes. The number of nodes with whom a node shares data is the distribution factor. In the case of Cassandra, `DF = 2RF - 2`. == Goals The goals this suggestion attempts to help solve include the following. === Goal: DF should not be tied to RF, nor N `DF` is important for these reasons: * The `DF` determines how many nodes are involved in re-constructing a lost node after failure; the higher the `DF`, the less of a performance impact a reconstruction has on remaining nodes. * The `DF` determines the significance on other nodes, with respect to read/write load, on a node being down. * The `DF` affects the probability of multiple failures causing data loss, since one looses data if any `RF` nodes within a group of `DF` nodes all go down. Having `DF` tied to `RF` like Cassandra does now has its problems. A single node failure has a significant effect on the performance characteristics of neighboring nodes (in terms relative to the normal level of load on the neighbors). On large data sets, a failed node needing reconstruction is a significant event, as it * Increases the load on neighbors just from going down. * Further increases the load on neighbors as they have to stream data (adding I/O and cache thrashing). This typically leads to the desire to throttle / rate limit reconstruction, which adds to the reconstruction window in addition to the fact that it was already naturally bottlenecking on neighbors. The other extreme is to tie `DF` to `N`, such that the data contained on one node, has it's secondary replicas spread out over the entire ring. This is an unacceptable choice because the probabiliy of multiple failures increases linearly with the cluster size. In other words, we want `DF` to be tied to neither `RF` nor `N`. Rather, `DF` should be chosen as a trade-off between the effects of `DF`: * The higher the `DF`, the higher the probability of data loss in case of multiple failures. * The higher the `DF`, the faster to reconstruct/replace a lost node. * The higher the `DF`, the less impact is seen on node failures on the performance requirements on other nodes. In making this determination, one must take into account that if a larger `DF` makes reconstruction/replacement significantly faster, that also decreases the time window in which multiple failures can occurr. Increasing `DF` is thus not *necessarily* increasing the total probability of data loss (for small values of `DF`). === Goal: Topologically aware redundancy We maintain the goal of being topology aware for the purpose of ensuring that we place replicas on "independent" nodes (for a definition of "independent" that is up to the operator). This is currently implemented in a limited fashion by the DC and rack awareness offered by the `NetworkTopologyStrategy`. We want to go further than that, and allow close to arbitrary topologies to be modeled. For example, we may require, in a single data center, that a piece of data is replicated to 3 distinct racks, at least two of which must be within different cages in the data center. === Goal: Topologically aware locality Somewhat in conflict with the previous goal, there is also a desire to keep replicas "close". In a sufficiently large cluster, it becomes cost-prohibitive to pipe all inter-replica traffic through a fast backbone network. Ideally, one want to balance the two. For example, one could arrange for replicas to be distributed across different racks in a cage in a data center, limiting cross-replica network bandwidth requirements to the local switch in said cage, while maintaining independent power supply/networking to racks for redundancy. With gigabit+ speeds from a node, it is very easy to saturate networking infrastructure once you start having hundreds of nodes or more. === Goal: Topologically aware routing Once a cluster grows sufficiently large, even with topologically aware locality, you eventually want to avoid the everybody-talks-to-everybody situtation of current Cassandra, for network efficiency reasons. This includes the clients. Topologically aware routing would enable selection of a co-ordinator node more intelligently (consider "edge" router nodes that are Cassandra aware, to which application clients connect to - e.g. you might have one such per data center, or part of a DC), and provided sufficient site-specific integration, allow the routing of traffic to that co-ordinator to be routed efficiently at the networking level. A request could go all the way from a client, to a co-ordinator/router, to responsible replicas, and back - never crossing a backbone switch/router. === Goal: Agility We want adding, removing or re-balancing nodes to be a trivial operation for the operator's perspective, and to have a minimal impact on the performance of the cluster. We want to enable adding individual hosts, groups of hosts, or entire racks of hosts easily and without significant risk of mistakes. There should be no need to carefully plan and co-ordinate upgrades due to performance reasons. The operator should have the flexibility of quickly adding capacity as demands change (which they will, even with careful planning to minimize it). There must not be artificial limitations like requiring a doubling of the cluster in order to avoid very painful and performance degrading operations. The operator should never be in a position to have to consider the trae-off between adding 30% capacity now at great cost in operator effort and risk, vs. waiting another two weeks for that promised delivery of servers, for example. Consider a large deployment with multiple clusters. It is desirable to capacity project each cluster individually, and have a constantly re-populated pool of servers that are used for expansion and/or replacement of nodes. Adding capacity to existing cluters could be done even on a daily basis, in an effort to keap average utilization as smooth as possible (having a varying average utilization contributes is one more factor that contributes to the need to over-provision; in a 1000 node cluster expected to double in size in a year for example, it is efficient to incrementally add capacity rather than, say, add 500 on day one and 500 at the half-way point; if nothing else, doing the latter would imply a 250 node average over-provisioning during a year *just* due to lack of agility in expansion, on top of other factors). === Goal: Heterogenous nodes We want it to be realistic and minimally impactful to have nodes of different performance characteristics in the cluster. === Goal: Operator simplicity The current situation is unacceptably complex. Maintaing a multi-DC cluster with rack awareness requires significant planning, care and often automation on the part of the operator to maintain an appropriately layed out ring, without causing rack hotspots or other issues. Auditing correct ring layout is essentially impossible for a human with the out-of-the-box tools, and currently require site-specific scripting (e.g., `nodetool ring` is close to useless without significant mental or automated post-processing). We want to significantly increase operator simplicity (minimizing the risk for operator error and minimize the effort required by an operator). == Proposed solution The proposal draws inspiration from the CRUSH paper: http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf Currently, because this is an unfinished document, that paper is pre-requisite reading. Here are the deltas proposed: === Buckets and pre-calculation The key on which we would apply the crush algorithm would not be the Cassandra row key. Instead the Cassandra keyspace is partitioned (by hashing) into N buckets. N must be large enough to accomodate future cluster expansions, and to accomodate probabilistic effects when applying a CRUSH-like algorithm so that we achieve balancing. The main reasons to use buckets are: * It allows pre-calculation of the results of the CRUSH-like algorithm, which removes the CPU/balancing trade-offs from the equation that the CRUSH paper focuses on. We can go with the optimally balanced solution. * A "bucket" is tantamount to a ring segment. Similarly to how Cassandra already does it, we want to efficiently be able to transfer a single bucket without having to traverse the entire data set or a node-global index of all data. Thus, buckets directly relate to structuring data on disk in a way which is non-orthogonal to the unit of data being moved around on topology changes. * It helps the balancing/accurace trade-off, see below. === Replica selection In order to cater to Distrihution Factor, we need to use the primary replica, as opposed to the bucket id, in the key when applying the CRUSH-like algorithm on finding secondary replicas. === Balancing accuracy vs. RDF vs. unnecessary movement trade-off When limiting replicas to be among RDF other nodes, you introduce an artificial constraint on replica placement that causes greater imbalance due to the low number of nodes you are distributing over (each node has replicas for data otherwise primary on DF-1 nodes, and assuming DF is not huge, DF-1 is a smallish number, and thus balancing will be limited). An algorithm to get around this is to calculate the placement in a deterministic fashion (such as in order) with resperct to bucket id. If, when assigning a secondary replica, the proposed replica is already above quota, you reject the selection and keep going. This achieves balance, but at the cost of additional data movement because every time you perform a rejection, you are causing the placement of a virtual bucket to be affected by something *other* than the nodes on which it is placed. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)