Hi Scott, First, let me say that this email was amazing - I'm always appreciative of the time that anyone puts into mailing list replies, especially ones as thorough, well-thought and articulated as this one. I'm a firm believer that these types of replies reflect a strong and durable open-source community. You, sir, are a bad ass :) Thanks so much!
As for the '5 9s' comment, I apologize for even writing that - it threw everyone off. It was a shorthand way of saying "this data store is so critical to the product, that if it ever goes down entirely (as it did for one user of 4 nodes, all at the same time), then we're screwed." I was hoping to trigger the 'hrm - what have we done ourselves to work to that availability that wasn't easily represented in the documentation' train of thought. It proved to be a red herring however, so I apologize for even bringing it up. Thanks *very* much for the reply. I'll be sure to follow up with the list as I come across any particular issues and I'll also report my own findings in the interest of (hopefully) being beneficial to anyone in the future. Cheers, Les On Wed, Jun 22, 2011 at 4:58 PM, C. Scott Andreas <csco...@urbanairship.com>wrote: > Hi Les, > > I wanted to offer a couple thoughts on where to start and strategies for > approaching development and deployment with reliability in mind. > > One way that we've found to more productively think about the reliability > of our data tier is to focus our thoughts away from a concept of "uptime or > *x* nines" toward one of "error rates." Ryan mentioned that "it depends," > and while brief, this is actually a very correct comment. Perhaps I can help > elaborate. > > Failures in systems distributed across multiple systems in multiple > datacenters can rarely be described in terms of binary uptime guarantees > (e.g., either everything is up or everything is down). Instead, certain > nodes may be unavailable at certain times, but given appropriate read and > write parameters (and their implicit tradeoffs), these service interruptions > may remain transparent. > > Cassandra provides a variety of tools to allow you to tune these, two of > the most important of which are the consistency level for reads and writes > and your replication factor. I'm sure you're familiar with these, but > mention them because thinking hard about the tradeoffs you're willing to > make in terms of consistency and replication may heavily impact your > operational experience if availability is of utmost importance. > > Of course, the single-node operational story is very important as well. > Ryan's "it depends" comment here takes on painful significance for myself, > as we've found that the manner in which read and write loads vary, their > duration, and intensity can have very different operational profiles and > failure modes. If relaxed consistency is acceptable for your reads and > writes, you'll likely find querying with CL.ONE to be more "available" than > QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is > economical for you to provision extra nodes for a higher replication factor, > you will increase your ability to continue reading and writing in the event > of single- or multiple-node failures. > > One of the prime challenges we've faced is reducing the frequency and > intensity of full garbage collections in the JVM, which tend to result in > single-node unavailability. Thanks to help from Jonathan Ellis and Peter > Schuller (along with a fair amount of elbow grease ourselves), we've worked > through several of these issues and have arrived at a steady state that > leaves the ring happy even under load. We've not found GC tuning to bring > night-and-day differences outside of resolving the STW collections, but the > difference is noticeable. > > Occasionally, these issues will result from Cassandra's behavior itself; > documented APIs such as querying for the count of all columns associated > with a key will materialize the row across all nodes being queried. Once > when issuing a "count" query for a key that had around 300k columns at > CL.QUORUM, we knocked three nodes out of our ring by triggering a > stop-the-world collection that lasted about 30 seconds, so watch out for > things like that. > > Some of the other tuning knobs available to you involve tradeoffs such as > when to flush memtables or to trigger compactions, both of which are > somewhat intensive operations that can strain a cluster under heavy read or > write load, but which are equally necessary for the cluster to remain in > operation. If you find yourself pushing hard against these tradeoffs and > attempting to navigate a path between icebergs, it's very likely that the > best answer to the problem is "more or more powerful hardware." > > But a lot of this is tacit knowledge, which often comes through a bit of > pain but is hopefully operationally transparent to your users. Things that > you discover once the system is live in operation and your monitoring is > providing continuous feedback about the ring's health. This is where Sasha's > point becomes so critical -- having advanced early-warning systems in place, > watching monitoring and graphs closely even when everything's fine, and > beginning to understand how it *likes* to operate and what it tends to do > will give you a huge leg up on your reliability and allow you to react to > issues in the ring before they present operational impact. > > You mention that you've been building HA systems for a long time -- indeed, > far longer than I have, so I'm sure that you're also aware that good, solid > "up/down" binaries are hard to come by, that none of this is easy, and that > while some pointers are available (the defaults are actually quite good), > it's essentially impossible to offer "the best production defaults" because > they vary wildly based on your hardware, ring configuration, and read/write > load and query patterns. > > To that end, you might find it more productive to begin with the defaults > as you develop your system, and let the ring tell you how it's feeling as > you begin load testing. Once you have stressed it to the point of failure, > you'll see how it failed and either be able to isolate the cause and begin > planning to handle that mode, or better yet, understand your maximum > capacity limits given your current hardware and fire off a purchase order > the second you see spikes nearing 80% of the total measured capacity in > production (or apply lessons you've learned in capacity planning as > appropriate, of course). > > Cassandra's a great system, but you may find that it requires a fair amount > of active operational involvement and monitoring -- like any distributed > system -- to maintain in a highly-reliable fashion. Each of those nines > implies extra time and operational cost, hopefully within the boundaries of > the revenue stream the system is expected to support. > > Pardon the long e-mail and for waxing a bit philosophical. I hope this > provides some food for thought. > > - Scott > > --- > > C. Scott Andreas > Engineer, Urban Airship, Inc. > http://www.urbanairship.com >