Re: 99.999% uptime - Operations Best Practices?

C. Scott Andreas Wed, 22 Jun 2011 16:59:31 -0700

Hi Les,

I wanted to offer a couple thoughts on where to start and strategies for 
approaching development and deployment with reliability in mind.

One way that we've found to more productively think about the reliability of 
our data tier is to focus our thoughts away from a concept of "uptime or x 
nines" toward one of "error rates." Ryan mentioned that "it depends," and while 
brief, this is actually a very correct comment. Perhaps I can help elaborate.

Failures in systems distributed across multiple systems in multiple datacenters 
can rarely be described in terms of binary uptime guarantees (e.g., either 
everything is up or everything is down). Instead, certain nodes may be 
unavailable at certain times, but given appropriate read and write parameters 
(and their implicit tradeoffs), these service interruptions may remain 
transparent.

Cassandra provides a variety of tools to allow you to tune these, two of the 
most important of which are the consistency level for reads and writes and your 
replication factor. I'm sure you're  familiar with these, but mention them 
because thinking hard about the tradeoffs you're willing to make in terms of 
consistency and replication may heavily impact your operational experience if 
availability is of utmost importance.

Of course, the single-node operational story is very important as well. Ryan's 
"it depends" comment here takes on painful significance for myself, as we've 
found that the manner in which read and write loads vary, their duration, and 
intensity can have very different operational profiles and failure modes. If 
relaxed consistency is acceptable for your reads and writes, you'll likely find 
querying with CL.ONE to be more "available" than QUROUM or ALL, at the cost of 
reduced consistency. Similarly, if it is economical for you to provision extra 
nodes for a higher replication factor, you will increase your ability to 
continue reading and writing in the event of single- or multiple-node failures.

One of the prime challenges we've faced is reducing the frequency and intensity 
of full garbage collections in the JVM, which tend to result in single-node 
unavailability. Thanks to help from Jonathan Ellis and Peter Schuller (along 
with a fair amount of elbow grease ourselves), we've worked through several of 
these issues and have arrived at a steady state that leaves the ring happy even 
under load. We've not found GC tuning to bring night-and-day differences 
outside of resolving the STW collections, but the difference is noticeable.

Occasionally, these issues will result from Cassandra's behavior itself; 
documented APIs such as querying for the count of all columns associated with a 
key will materialize the row across all nodes being queried. Once when issuing 
a "count" query for a key that had around 300k columns at CL.QUORUM, we knocked 
three nodes out of our ring by triggering a stop-the-world collection that 
lasted about 30 seconds, so watch out for things like that.

Some of the other tuning knobs available to you involve tradeoffs such as when 
to flush memtables or to trigger compactions, both of which are somewhat 
intensive operations that can strain a cluster under heavy read or write load, 
but which are equally necessary for the cluster to remain in operation. If you 
find yourself pushing hard against these tradeoffs and attempting to navigate a 
path between icebergs, it's very likely that the best answer to the problem is 
"more or more powerful hardware."

But a lot of this is tacit knowledge, which often comes through a bit of pain 
but is hopefully operationally transparent to your users.  Things that you 
discover once the system is live in operation and your monitoring is providing 
continuous feedback about the ring's health. This is where Sasha's point 
becomes so critical -- having advanced early-warning systems in place, watching 
monitoring and graphs closely even when everything's fine, and beginning to 
understand how it likes to operate and what it tends to do will give you a huge 
leg up on your reliability and allow you to react to issues in the ring before 
they present operational impact.

You mention that you've been building HA systems for a long time -- indeed, far 
longer than I have, so I'm sure that you're also aware that good, solid 
"up/down" binaries are hard to come by, that none of this is easy, and that 
while some pointers are available (the defaults are actually quite good), it's 
essentially impossible to offer "the best production defaults" because they 
vary wildly based on your hardware, ring configuration, and read/write load and 
query patterns.

To that end, you might find it more productive to begin with the defaults as 
you develop your system, and let the ring tell you how it's feeling as you 
begin load testing. Once you have stressed it to the point of failure, you'll 
see how it failed and either be able to isolate the cause and begin planning to 
handle that mode, or better yet, understand your maximum capacity limits given 
your current hardware and fire off a purchase order the second you see spikes 
nearing 80% of the total measured capacity in production (or apply lessons 
you've learned in capacity planning as appropriate, of course).

Cassandra's a great system, but you may find that it requires a fair amount of 
active operational involvement and monitoring -- like any distributed system -- 
to maintain in a highly-reliable fashion. Each of those nines implies extra 
time and operational cost, hopefully within the boundaries of the revenue 
stream the system is expected to support.

Pardon the long e-mail and for waxing a bit philosophical. I hope this provides 
some food for thought.

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com

On Jun 22, 2011, at 4:16 PM, Les Hazlewood wrote:

> On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin <wool...@gmail.com> wrote:
> you have to use multiple data centers to really deliver 4 or 5 9's of service
> 
> We do, hence my question, as well as my choice of Cassandra :)
> 
> Best,
> 
> Les

Re: 99.999% uptime - Operations Best Practices?

Reply via email to