Re: 99.999% uptime - Operations Best Practices?

Les Hazlewood Wed, 22 Jun 2011 20:58:04 -0700

Edward,

Thank you so much for this reply - this is great stuff, and I really
appreciate it.


You'll be happy to know that I've already pre-ordered your book.  I'm
looking forward to it! (When is the ship date?)

Best regards,

Les

On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

>
>
> On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood <l...@katasoft.com> wrote:
>
>> Hi Thoku,
>>
>> You were able to more concisely represent my intentions (and their
>> reasoning) in this thread than I was able to do so myself.  Thanks!
>>
>> On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen <tho...@gmail.com> wrote:
>>
>>> I think that Les's question was reasonable. Why *not* ask the community
>>> for the 'gotchas'?
>>>
>>> Whether the info is already documented or not, it could be an opportunity
>>> to improve the documentation based on users' perception.
>>>
>>> The "you just have to learn" responses are fair also, but that reminds me
>>> of the days when running Oracle was a black art, and accumulated wisdom made
>>> DBAs irreplaceable.
>>>
>>
>> Yes, this was my initial concern.  I know that Cassandra is still young,
>> and I expect this to be the norm for a while, but I was hoping to make that
>> process a bit easier (for me and anyone else reading this thread in the
>> future).
>>
>> Some recommendations *are* documented, but they are dispersed / stale /
>>> contradictory / or counter-intuitive.
>>>
>>> Others have not been documented in the wiki nor in DataStax's doco, and
>>> are instead learned anecdotally or The Hard Way.
>>>
>>> For example, whether documented or not, some of the 'gotchas' that I
>>> encountered when I first started working with Cassandra were:
>>>
>>> * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says 
>>> this<http://wiki.apache.org/cassandra/GettingStarted>
>>> , Jira says that <https://issues.apache.org/jira/browse/CASSANDRA-2441>
>>> ).
>>> * Its not viable to run without JNA installed.
>>> * Disable swap memory.
>>> * Need to run nodetool repair on a regular basis.
>>>
>>> I'm looking forward to Edward Capriolo's Cassandra 
>>> book<https://www.packtpub.com/cassandra-apache-high-performance-cookbook/book>
>>>  which
>>> Les will probably find helpful.
>>>
>>
>> Thanks for linking to this.  I'm pre-ordering right away.
>>
>> And thanks for the pointers, they are exactly the kind of enumerated
>> things I was looking to elicit.  These are the kinds of things that are hard
>> to track down in a single place.  I think it'd be nice for the community to
>> contribute this stuff to a single page ('best practices', 'checklist',
>> whatever you want to call it).  It would certainly make things easier when
>> getting started.
>>
>> Thanks again,
>>
>> Les
>>
>
> Since I got a plug on the book I will chip in again to the thread :)
>
> Some things that were mentioned already:
>
> Install JNA absolutely (without JNA the snapshot command has to fork to
> hard link the sstables, I have seen clients backoff from this). Also the
> performance focused Cassandra devs always try to squeeze out performance by
> utilizing more native features.
>
> OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
> production, this way you get surprised less.
>
> Other stuff:
>
> RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0
> has better performance, but if you lose a node your capacity is diminished,
> rebuilding and rejoining a node involves more manpower more steps and more
> chances for human error.
>
> Collect statistics on the normal system items CPU, disk (size and
> utilization), memory. Then collect the JMX cassandra counters and understand
> how they interact. For example record ReadCount and WriteCount per column
> family, then use try to determine how this effects disk utilization. You can
> use this for capacity planning. Then try using a key/row cache. Evaluate
> again. Check the hit rate graph for your new cache. How did this effect your
> disk? You want to head off anything that can be a performance killer like
> traffic patterns changing or data growing significantly.
>
> Do not be short on hardware. I do not want to say "overbuy" but if uptime
> is important have spares drives and servers and have room to grow.
>
> Balance that ring :)
>
> I have not read the original thread concerning the problem you mentioned.
> One way to avoid OOM is large amounts of RAM :) On a more serious note most
> OOM's are caused by setting caches or memtables too large. If the OOM was
> caused by a software bug, the cassandra devs are on the ball and move fast.
> I still suggest not jumping into a release right away. I know its hard to
> live without counters or CQL since new things are super cool. But if you
> want all those 9s your going to have to stay disciplined. Unless a release
> has a fix for a problem you think you have, stay a minor or revision back,
> or at least wait some time before upgrading to it, and do some internal
> confidence testing before pulling the trigger on an update.
>
> Almost all usecases demand that repair be run regularly due to the nature
> of distributed deletes.
>
> Other good tips, subscribe to all the mailing lists, and hang out in the
> IRC channels cassandra, cassandra-dev, cassandra-ops. You get an osmoses
> learning effect and you learn to fix or head off issues you never had.
>

Re: 99.999% uptime - Operations Best Practices?

Reply via email to