99.999% uptime - Operations Best Practices?

Les Hazlewood Wed, 22 Jun 2011 14:25:03 -0700

I'm planning on using Cassandra as a product's core data store, and it is
imperative that it never goes down or loses data, even in the event of a
data center failure.  This uptime requirement ("five nines": 99.999% uptime)
w/ WAN capabilities is largely what led me to choose Cassandra over other
NoSQL products, given its history and 'from the ground up' design for such
operational benefits.


However, in a recent thread, a user indicated that all 4 of 4 of his
Cassandra instances were down because the OS killed the Java processes due
to memory starvation, and all 4 instances went down in a relatively short
period of time of each other.  Another user helped out and replied that
running 0.8 and nodetool repair on each node regularly via a cron job (once
a day?) seems to work for him.

Naturally this was disconcerting to read, given our needs for a Highly
Available product - we'd be royally screwed if this ever happened to us.
 But given Cassandra's history and it's current production use, I'm aware
that this HA/uptime is being achieved today, and I believe it is certainly
achievable.

So, is there a collective set of guidelines or best practices to ensure this
problem (or unavailability due to OOM) can be easily managed?

Things like memory settings, initial GC recommendations, cron
recommendations, ulimit settings, etc. that can be bundled up as a
best-practices "Production Kickstart"?

Could anyone share their nuggets of wisdom or point me to resources where
this may already exist?

Thanks!

Best regards,

Les

99.999% uptime - Operations Best Practices?

Reply via email to