Hi All, I am trying to understand the relationship between data set/SSTable(s) size and Cassandra heap.
Q1. Here is the memory calc from the Wiki: For a rough rule of thumb, Cassandra's internal datastructures will require about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches. This formula does not depend on the data set size. Does this mean that provided Cassandra has sufficient disk space to accommodate growing data set, it can run in fixed memory for bulk load? Am I right that memory impact of compacting increasing SSTAble sizes is capped by a parameter in_memory_compaction_limit_in_mb? Q2. What would I need to monitor to predict ahead the need to double the number of nodes assuming sufficient storage per node? Is there a simple rule of thumb saying that for a heap of size X a node can handle SSTable of size Y? I do realize that the i/o and CPU play a role here but could that be reduced to a factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming random partitioner and a fixed number of write clients. Q3. Does the formula account for deserialization during reads? What does 1G represent? Thank you very much, Oleg