> I am trying to understand the relationship between data set/SSTable(s) size > and > Cassandra heap.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations > For a rough rule of thumb, Cassandra's internal datastructures will require > about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal > caches. > > This formula does not depend on the data set size. Does this mean that > provided > Cassandra has sufficient disk space to accommodate growing data set, it can > run > in fixed memory for bulk load? No, for reasons that I hope are covered at the above URL. The calculation you refer to has more to with how you tweak your memtables for performance which is only loosely coupled to data size. The cost of index sampling and bloom filters are very directly related to database size however (see wiki url). It is essentially a trade-off; where a typical b-tree database would simply start demanding additional seeks as the index size grows larger, Cassandra does limit the seeks but instead has a stricter memory requirements. If you're only looking to smack huge amounts of data into the database without every reading them, or reading them very very rarely, it is sub-optimal from a memory perspective. Note though that these are memory requirements "per row key", rather than "per byte of data". >Am I right that memory impact of compacting > increasing SSTAble sizes is capped by a parameter > in_memory_compaction_limit_in_mb? That limits the amount of memory allocated for individual row compactions yes, and will put a cap on the GC pressure generated in addition to allowing huge rows to be compacted independently of heap size. > Q2. What would I need to monitor to predict ahead the need to double the > number > of nodes assuming sufficient storage per node? Is there a simple rule of thumb > saying that for a heap of size X a node can handle SSTable of size Y? I do > realize that the i/o and CPU play a role here but could that be reduced to a > factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming > random partitioner and a fixed number of write clients. Disregarding memtable tweaking that will have more to do with throughput, the most important factor in terms of scaling memory requirements w.r.t. data size, is the number of row keys and the length of the average row. I recommend just empirically inserting say 10 million rows with realistic row keys and observing the size of the resulting index and bloom filter files. Take into account to what extent compaction will cause memory usage to temporarily spike. Also take into account that if you plan on having very large rows, the indexes will begin having more than one entry per row (see column_index_size_in_kb in the configuration). If your use-case is somehow truly extreme in the sense of huge data sets with little to no requirement on query efficiency, the "per row key" costs can be cut down by adjusting index_interval in the configuration to affect the cost of index sampling, and the target false positive rates of bloom filters could be adjusted (in source, not conf) to cut down on that. But really, that would be an unusual thing to do I think and I wouldn't recommend touching that without careful consideration and deep understanding of your expected use-case. > Q3. Does the formula account for deserialization during reads? What does 1G > represent? I don't know the background of that particular wiki statement, but my guess is that 1G is just sort of a general gut feel "good to have" base memory size rather than something very specifically calculated. -- / Peter Schuller