Hi Les, I wanted to offer a couple thoughts on where to start and strategies for approaching development and deployment with reliability in mind.
One way that we've found to more productively think about the reliability of our data tier is to focus our thoughts away from a concept of "uptime or x nines" toward one of "error rates." Ryan mentioned that "it depends," and while brief, this is actually a very correct comment. Perhaps I can help elaborate. Failures in systems distributed across multiple systems in multiple datacenters can rarely be described in terms of binary uptime guarantees (e.g., either everything is up or everything is down). Instead, certain nodes may be unavailable at certain times, but given appropriate read and write parameters (and their implicit tradeoffs), these service interruptions may remain transparent. Cassandra provides a variety of tools to allow you to tune these, two of the most important of which are the consistency level for reads and writes and your replication factor. I'm sure you're familiar with these, but mention them because thinking hard about the tradeoffs you're willing to make in terms of consistency and replication may heavily impact your operational experience if availability is of utmost importance. Of course, the single-node operational story is very important as well. Ryan's "it depends" comment here takes on painful significance for myself, as we've found that the manner in which read and write loads vary, their duration, and intensity can have very different operational profiles and failure modes. If relaxed consistency is acceptable for your reads and writes, you'll likely find querying with CL.ONE to be more "available" than QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is economical for you to provision extra nodes for a higher replication factor, you will increase your ability to continue reading and writing in the event of single- or multiple-node failures. One of the prime challenges we've faced is reducing the frequency and intensity of full garbage collections in the JVM, which tend to result in single-node unavailability. Thanks to help from Jonathan Ellis and Peter Schuller (along with a fair amount of elbow grease ourselves), we've worked through several of these issues and have arrived at a steady state that leaves the ring happy even under load. We've not found GC tuning to bring night-and-day differences outside of resolving the STW collections, but the difference is noticeable. Occasionally, these issues will result from Cassandra's behavior itself; documented APIs such as querying for the count of all columns associated with a key will materialize the row across all nodes being queried. Once when issuing a "count" query for a key that had around 300k columns at CL.QUORUM, we knocked three nodes out of our ring by triggering a stop-the-world collection that lasted about 30 seconds, so watch out for things like that. Some of the other tuning knobs available to you involve tradeoffs such as when to flush memtables or to trigger compactions, both of which are somewhat intensive operations that can strain a cluster under heavy read or write load, but which are equally necessary for the cluster to remain in operation. If you find yourself pushing hard against these tradeoffs and attempting to navigate a path between icebergs, it's very likely that the best answer to the problem is "more or more powerful hardware." But a lot of this is tacit knowledge, which often comes through a bit of pain but is hopefully operationally transparent to your users. Things that you discover once the system is live in operation and your monitoring is providing continuous feedback about the ring's health. This is where Sasha's point becomes so critical -- having advanced early-warning systems in place, watching monitoring and graphs closely even when everything's fine, and beginning to understand how it likes to operate and what it tends to do will give you a huge leg up on your reliability and allow you to react to issues in the ring before they present operational impact. You mention that you've been building HA systems for a long time -- indeed, far longer than I have, so I'm sure that you're also aware that good, solid "up/down" binaries are hard to come by, that none of this is easy, and that while some pointers are available (the defaults are actually quite good), it's essentially impossible to offer "the best production defaults" because they vary wildly based on your hardware, ring configuration, and read/write load and query patterns. To that end, you might find it more productive to begin with the defaults as you develop your system, and let the ring tell you how it's feeling as you begin load testing. Once you have stressed it to the point of failure, you'll see how it failed and either be able to isolate the cause and begin planning to handle that mode, or better yet, understand your maximum capacity limits given your current hardware and fire off a purchase order the second you see spikes nearing 80% of the total measured capacity in production (or apply lessons you've learned in capacity planning as appropriate, of course). Cassandra's a great system, but you may find that it requires a fair amount of active operational involvement and monitoring -- like any distributed system -- to maintain in a highly-reliable fashion. Each of those nines implies extra time and operational cost, hopefully within the boundaries of the revenue stream the system is expected to support. Pardon the long e-mail and for waxing a bit philosophical. I hope this provides some food for thought. - Scott --- C. Scott Andreas Engineer, Urban Airship, Inc. http://www.urbanairship.com On Jun 22, 2011, at 4:16 PM, Les Hazlewood wrote: > On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin <wool...@gmail.com> wrote: > you have to use multiple data centers to really deliver 4 or 5 9's of service > > We do, hence my question, as well as my choice of Cassandra :) > > Best, > > Les