Re: 99.999% uptime - Operations Best Practices?

Edward Capriolo Wed, 22 Jun 2011 19:04:07 -0700

On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood <l...@katasoft.com> wrote:

> Hi Thoku,
>
> You were able to more concisely represent my intentions (and their
> reasoning) in this thread than I was able to do so myself.  Thanks!
>
> On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen <tho...@gmail.com> wrote:
>
>> I think that Les's question was reasonable. Why *not* ask the community
>> for the 'gotchas'?
>>
>> Whether the info is already documented or not, it could be an opportunity
>> to improve the documentation based on users' perception.
>>
>> The "you just have to learn" responses are fair also, but that reminds me
>> of the days when running Oracle was a black art, and accumulated wisdom made
>> DBAs irreplaceable.
>>
>
> Yes, this was my initial concern.  I know that Cassandra is still young,
> and I expect this to be the norm for a while, but I was hoping to make that
> process a bit easier (for me and anyone else reading this thread in the
> future).
>
> Some recommendations *are* documented, but they are dispersed / stale /
>> contradictory / or counter-intuitive.
>>
>> Others have not been documented in the wiki nor in DataStax's doco, and
>> are instead learned anecdotally or The Hard Way.
>>
>> For example, whether documented or not, some of the 'gotchas' that I
>> encountered when I first started working with Cassandra were:
>>
>> * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says 
>> this<http://wiki.apache.org/cassandra/GettingStarted>
>> , Jira says that <https://issues.apache.org/jira/browse/CASSANDRA-2441>).
>> * Its not viable to run without JNA installed.
>> * Disable swap memory.
>> * Need to run nodetool repair on a regular basis.
>>
>> I'm looking forward to Edward Capriolo's Cassandra 
>> book<https://www.packtpub.com/cassandra-apache-high-performance-cookbook/book>
>>  which
>> Les will probably find helpful.
>>
>
> Thanks for linking to this.  I'm pre-ordering right away.
>
> And thanks for the pointers, they are exactly the kind of enumerated things
> I was looking to elicit.  These are the kinds of things that are hard to
> track down in a single place.  I think it'd be nice for the community to
> contribute this stuff to a single page ('best practices', 'checklist',
> whatever you want to call it).  It would certainly make things easier when
> getting started.
>
> Thanks again,
>
> Les
>

Since I got a plug on the book I will chip in again to the thread :)

Some things that were mentioned already:

Install JNA absolutely (without JNA the snapshot command has to fork to hard
link the sstables, I have seen clients backoff from this). Also the
performance focused Cassandra devs always try to squeeze out performance by
utilizing more native features.

OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
production, this way you get surprised less.

Other stuff:

RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0 has
better performance, but if you lose a node your capacity is diminished,
rebuilding and rejoining a node involves more manpower more steps and more
chances for human error.

Collect statistics on the normal system items CPU, disk (size and
utilization), memory. Then collect the JMX cassandra counters and understand
how they interact. For example record ReadCount and WriteCount per column
family, then use try to determine how this effects disk utilization. You can
use this for capacity planning. Then try using a key/row cache. Evaluate
again. Check the hit rate graph for your new cache. How did this effect your
disk? You want to head off anything that can be a performance killer like
traffic patterns changing or data growing significantly.

Do not be short on hardware. I do not want to say "overbuy" but if uptime is
important have spares drives and servers and have room to grow.

Balance that ring :)

I have not read the original thread concerning the problem you mentioned.
One way to avoid OOM is large amounts of RAM :) On a more serious note most
OOM's are caused by setting caches or memtables too large. If the OOM was
caused by a software bug, the cassandra devs are on the ball and move fast.
I still suggest not jumping into a release right away. I know its hard to
live without counters or CQL since new things are super cool. But if you
want all those 9s your going to have to stay disciplined. Unless a release
has a fix for a problem you think you have, stay a minor or revision back,
or at least wait some time before upgrading to it, and do some internal
confidence testing before pulling the trigger on an update.

Almost all usecases demand that repair be run regularly due to the nature of
distributed deletes.

Other good tips, subscribe to all the mailing lists, and hang out in the IRC
channels cassandra, cassandra-dev, cassandra-ops. You get an osmoses
learning effect and you learn to fix or head off issues you never had.

Re: 99.999% uptime - Operations Best Practices?

Reply via email to