Re: reported bloom filter FP ratio

Peter Schuller Mon, 26 Dec 2011 15:04:11 -0800

>> I don't understand how you reached that conclusion.
>
> On my nodes most memory is consumed by bloom filters. Also 1.0 creates


The point is that just because that's the problem you have, doesn't
mean the default is wrong, since it quite clearly depends on use-case.
If your relative amounts of rows is low compared to the cost of
sustaining a read-heavy workload, the trade-off is different.

> Cassandra does not measure memory used by index sampling yet, i suspect that
> it will be memory hungry too and can be safely lowered by default i see very
> little difference by changing index sampling from 64 to 512.

Bloom filters and index sampling are the two major contributors to
memory use that scale with the number of rows (and thus typically with
data size). This is known. Index sampling can indeed be significant.

The default is 128 though, not 64. Here again it's a matter of
trade-offs; 512 may have worked for you, but it doesn't mean it's an
appropriate default (I am not arguing for 128 either, I am just saying
that it's more complex than observing that in your particular case you
didn't see a problem with 512). Part of the trade-off is additional
CPU usage implied in streaming and deserializing a larger amount of
data per average sstable index read; part of the trade-off is also
effects on I/O; a sparser index sampling could result in a higher
amount of seeks per index lookup.

> Basic problem with cassandra daily administration which i am currently
> solving is that memory consumption grows with your dataset size. I dont
> really like this design - you put more data in and cluster can OOM. This
> makes cassandra not optimal solution for use in data archiving. It will get
> better after tunable bloom filters will be committed.

That is a good reason for both to be configurable IMO.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: reported bloom filter FP ratio

Reply via email to