> I am trying to understand the relationship between data set/SSTable(s) size 
> and
> Cassandra heap.

http://wiki.apache.org/cassandra/LargeDataSetConsiderations

> For a rough rule of thumb, Cassandra's internal datastructures will require
> about  memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal 
> caches.
>
> This formula does not depend on the data set size. Does this mean that 
> provided
> Cassandra has sufficient disk space to accommodate growing data set,  it can 
> run
> in fixed memory for bulk load?

No, for reasons that I hope are covered at the above URL. The
calculation you refer to has more to with how you tweak your memtables
for performance which is only loosely coupled to data size.

The cost of index sampling and bloom filters are very directly related
to database size however (see wiki url). It is essentially a
trade-off; where a typical b-tree database would simply start
demanding additional seeks as the index size grows larger, Cassandra
does limit the seeks but instead has a stricter memory requirements.
If you're only looking to smack huge amounts of data into the database
without every reading them, or reading them very very rarely, it is
sub-optimal from a memory perspective.

Note though that these are memory requirements "per row key", rather
than "per byte of data".

>Am I right that memory impact of compacting
> increasing SSTAble sizes is capped by a parameter
> in_memory_compaction_limit_in_mb?

That limits the amount of memory allocated for individual row
compactions yes, and will put a cap on the GC pressure generated in
addition to allowing huge rows to be compacted independently of heap
size.

> Q2. What would I need to monitor to predict ahead the need to double the 
> number
> of nodes assuming sufficient storage per node? Is there a simple rule of thumb
> saying that for a heap of size X a node can handle SSTable of size Y? I do
> realize that the i/o and CPU play a role here but could that be reduced to a
> factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming
> random partitioner and a fixed number of write clients.

Disregarding memtable tweaking that will have more to do with
throughput, the most important factor in terms of scaling memory
requirements w.r.t. data size, is the number of row keys and the
length of the average row.

I recommend just empirically inserting say 10 million rows with
realistic row keys and observing the size of the resulting index and
bloom filter files. Take into account to what extent compaction will
cause memory usage to temporarily spike.

Also take into account that if you plan on having very large rows, the
indexes will begin having more than one entry per row (see
column_index_size_in_kb in the configuration).

If your use-case is somehow truly extreme in the sense of huge data
sets with little to no requirement on query efficiency, the "per row
key" costs can be cut down by adjusting index_interval in the
configuration to affect the cost of index sampling, and the target
false positive rates of bloom filters could be adjusted (in source,
not conf) to cut down on that. But really, that would be an unusual
thing to do I think and I wouldn't recommend touching that without
careful consideration and deep understanding of your expected
use-case.

> Q3. Does the formula account for deserialization during reads? What does 1G
> represent?

I don't know the background of that particular wiki statement, but my
guess is that 1G is just sort of a general gut feel "good to have"
base memory size rather than something very specifically calculated.

-- 
/ Peter Schuller

Reply via email to