On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > I like this idea. It feels like a good 80/20 solution -- 80% of the > benefits, 20% of the effort. More like 5% of the effort. I can't > even enumerate all the places full vnode support would change, but an > "active token range" concept would be relatively limited in scope.
It only addresses 1 of Sam's original 5 points, so I wouldn't call it an "80% solution". > Full vnodes feels a lot more like the counters quagmire, where > Digg/Twitter worked on it for... 8? months, and then DataStax worked > on it about for about 6 months post-commit, and we're still finding > the occasional bug-since-0.7 there. With the benefit of hindsight, as > bad as maintaining that patchset was out of tree, committing it as > early as we did was a mistake. We won't do that again. (On the > bright side, git makes maintaining such a patchset easier now.) And yet counters have become a very important feature for Cassandra; We're better off with them, than without. I think there were a number of problems with how counters went down that could be avoided here. For one, we can take a phased, incremental approach, rather than waiting 8 months to drop a large patchset. > On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson <rbran...@datastax.com> wrote: >> I think if we could go back and rebuild Cassandra from scratch, vnodes >> would likely be implemented from the beginning. However, I'm concerned that >> implementing them now could be a big distraction from more productive uses >> of all of our time and introduce major potential stability issues into what >> is becoming a business critical piece of infrastructure for many people. >> However, instead of just complaining and pedantry, I'd like to offer a >> feasible alternative: >> >> Has there been consideration given to the idea of a supporting a single >> token range for a node? >> >> While not theoretically as capable as vnodes, it seems to me to be more >> practical as it would have a significantly lower impact on the codebase and >> provides a much clearer migration path. It also seems to solve a majority >> of complaints regarding operational issues with Cassandra clusters. >> >> Each node would have a lower and an upper token, which would form a range >> that would be actively distributed via gossip. Read and replication >> requests would only be routed to a replica when the key of these operations >> matched the replica's token range in the gossip tables. Each node would >> locally store it's own current active token range as well as a target token >> range it's "moving" towards. >> >> As a new node undergoes bootstrap, the bounds would be gradually expanded >> to allow it to handle requests for a wider range of the keyspace as it >> moves towards it's target token range. This idea boils down to a move from >> hard cutovers to smoother operations by gradually adjusting active token >> ranges over a period of time. It would apply to token change operations >> (nodetool 'move' and 'removetoken') as well. >> >> Failure during streaming could be recovered at the bounds instead of >> restarting the whole process as the active bounds would effectively track >> the progress for bootstrap & target token changes. Implicitly these >> operations would be throttled to some degree. Node repair (AES) could also >> be modified using the same overall ideas provide a more gradual impact on >> the cluster overall similar as the ideas given in CASSANDRA-3721. >> >> While this doesn't spread the load over the cluster for these operations >> evenly like vnodes does, this is likely an issue that could be worked >> around by performing concurrent (throttled) bootstrap & node repair (AES) >> operations. It does allow some kind of "active" load balancing, but clearly >> this is not as flexible or as useful as vnodes, but you should be using >> RandomPartitioner or sort-of-randomized keys with OPP right? ;) >> >> As a side note: vnodes fail to provide solutions to node-based limitations >> that seem to me to cause a substantial portion of operational issues such >> as impact of node restarts / upgrades, GC and compaction induced latency. I >> think some progress could be made here by allowing a "pack" of independent >> Cassandra nodes to be ran on a single host; somewhat (but nowhere near >> entirely) similar to a pre-fork model used by some UNIX-based servers. >> >> Input? >> >> -- >> Rick Branson >> DataStax > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com -- Eric Evans Acunu | http://www.acunu.com | @acunu