I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an "active token range" concept would be relatively limited in scope.
Full vnodes feels a lot more like the counters quagmire, where Digg/Twitter worked on it for... 8? months, and then DataStax worked on it about for about 6 months post-commit, and we're still finding the occasional bug-since-0.7 there. With the benefit of hindsight, as bad as maintaining that patchset was out of tree, committing it as early as we did was a mistake. We won't do that again. (On the bright side, git makes maintaining such a patchset easier now.) On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson <rbran...@datastax.com> wrote: > I think if we could go back and rebuild Cassandra from scratch, vnodes > would likely be implemented from the beginning. However, I'm concerned that > implementing them now could be a big distraction from more productive uses > of all of our time and introduce major potential stability issues into what > is becoming a business critical piece of infrastructure for many people. > However, instead of just complaining and pedantry, I'd like to offer a > feasible alternative: > > Has there been consideration given to the idea of a supporting a single > token range for a node? > > While not theoretically as capable as vnodes, it seems to me to be more > practical as it would have a significantly lower impact on the codebase and > provides a much clearer migration path. It also seems to solve a majority > of complaints regarding operational issues with Cassandra clusters. > > Each node would have a lower and an upper token, which would form a range > that would be actively distributed via gossip. Read and replication > requests would only be routed to a replica when the key of these operations > matched the replica's token range in the gossip tables. Each node would > locally store it's own current active token range as well as a target token > range it's "moving" towards. > > As a new node undergoes bootstrap, the bounds would be gradually expanded > to allow it to handle requests for a wider range of the keyspace as it > moves towards it's target token range. This idea boils down to a move from > hard cutovers to smoother operations by gradually adjusting active token > ranges over a period of time. It would apply to token change operations > (nodetool 'move' and 'removetoken') as well. > > Failure during streaming could be recovered at the bounds instead of > restarting the whole process as the active bounds would effectively track > the progress for bootstrap & target token changes. Implicitly these > operations would be throttled to some degree. Node repair (AES) could also > be modified using the same overall ideas provide a more gradual impact on > the cluster overall similar as the ideas given in CASSANDRA-3721. > > While this doesn't spread the load over the cluster for these operations > evenly like vnodes does, this is likely an issue that could be worked > around by performing concurrent (throttled) bootstrap & node repair (AES) > operations. It does allow some kind of "active" load balancing, but clearly > this is not as flexible or as useful as vnodes, but you should be using > RandomPartitioner or sort-of-randomized keys with OPP right? ;) > > As a side note: vnodes fail to provide solutions to node-based limitations > that seem to me to cause a substantial portion of operational issues such > as impact of node restarts / upgrades, GC and compaction induced latency. I > think some progress could be made here by allowing a "pack" of independent > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > entirely) similar to a pre-fork model used by some UNIX-based servers. > > Input? > > -- > Rick Branson > DataStax -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com