I second Peters point, big servers are not always the best. 

My experience (using spinning disks) is that 200 to 300 GB of live data load 
per node (including replicated data) is a sweet spot. Above this the time taken 
for compaction, repair, off node backups, node moves etc starts to be a pain. 

Also, suffering catastrophic failure of 1 node in 100 is a better situation 
that 1 node in 16. 

Finally, when you have more servers with less high performance disks you also 
get more memory and more CPU cores. 

(I'm obviously ignoring all the ops side here, automate with chef or 
http://www.datastax.com/products/opscenter ). 

wrt failure modes I wrote this last year, it's about single DC deployments but 
you can probably work it out for multi-dc 
http://thelastpickle.com/2011/06/13/Down-For-Me/

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/01/2012, at 1:18 PM, Thorsten von Eicken wrote:

> Good point. One thing I'm wondering about cassandra is what happens when
> there is a massive failure. For example, if 1/3 of the nodes go down or
> become unreachable. This could happen in EC2 if an AZ has a failure, or
> in a datacenter if a whole rack or UPS goes dark. I'm not so concerned
> about the time where the nodes are down. If I understand replication,
> consistency, ring, and such I can architect things such that what must
> continue running does continue.
> 
> What I'm concerned about is when these nodes all come back up or
> reconnect. I have a hard time figuring out what exactly happens other
> than the fact that hinted handoffs get processed. Are the restarted
> nodes handling reads during that time? If so, they could serve up
> massive amounts of stale data, no? Do they then all start a repair, or
> is this something that needs to be run manually? If many do a repair at
> the same time, do I effectively end up with a down cluster due to the
> repair load? If no node was lost, is a repair required or are the hinted
> handoffs sufficient?
> 
> Is there a manual or wiki section that discusses some of this and I just
> missed it?
> 
> On 1/21/2012 2:25 PM, Peter Schuller wrote:
>>> Thanks for the responses! We'll definitely go for powerful servers to
>>> reduce the total count. Beyond a dozen servers there really doesn't seem
>>> to be much point in trying to increase count anymore for
>> Just be aware that if "big" servers imply *lots* of data (especially
>> in relation to memory size), it's not necessarily the best trade-off.
>> Consider the time it takes to do repairs, streaming, node start-up,
>> etc.
>> 
>> If it's only about CPU resources then bigger nodes probably make more
>> sense if the h/w is cost effective.
>> 

Reply via email to