Aaron makes a good point, the happiest customers in my opinion are the ones that choose nodes on the smaller side, and more of them.

Regarding the working set, I am referring to the OS cache. On linux, with JNA, Cassadra utilizes, to great effectiveness, memory mapped files and this is where I would expect most of your working set to reside.

The smaller the data set on each node the higher the proportion of CPU cycles, disk IO, network bandwidth, and memory you can dedicate to working with that data and making it work within your use case.

Ben

On 6/7/11 2:15 PM, aaron morton wrote:
I'd also say consider what happens during maintenance and failure scenarios. 
Moving 10's TB around takes a lot longer than 100's GB.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 8 Jun 2011, at 06:40, AJ wrote:

Thanks to everyone who responded thus far.


On 6/7/2011 10:16 AM, Benjamin Coverston wrote:
<snip>
Not to say that there aren't workloads where having many TB/Node doesn't work, 
but if you're planning to read from the data you're writing you do want to 
ensure that your working set is stored in memory.

Thank you Ben.  Can you elaborate some more on the above point?  Are you 
referring to the OS's working set or the Cassandra caches?  Why exactly do I 
need to ensure this?

I am also wondering if there is any reason I should segregate my frequently 
write/read smallish data set (such as usage statistics) from my bulk mostly 
read-only data set (static content) into separate CFs if the schema allows it.  
Would this be of any benefit?

--
Ben Coverston
Director of Operations
DataStax -- The Apache Cassandra Company
http://www.datastax.com/

Reply via email to