>> I don't understand how you reached that conclusion. > > On my nodes most memory is consumed by bloom filters. Also 1.0 creates
The point is that just because that's the problem you have, doesn't mean the default is wrong, since it quite clearly depends on use-case. If your relative amounts of rows is low compared to the cost of sustaining a read-heavy workload, the trade-off is different. > Cassandra does not measure memory used by index sampling yet, i suspect that > it will be memory hungry too and can be safely lowered by default i see very > little difference by changing index sampling from 64 to 512. Bloom filters and index sampling are the two major contributors to memory use that scale with the number of rows (and thus typically with data size). This is known. Index sampling can indeed be significant. The default is 128 though, not 64. Here again it's a matter of trade-offs; 512 may have worked for you, but it doesn't mean it's an appropriate default (I am not arguing for 128 either, I am just saying that it's more complex than observing that in your particular case you didn't see a problem with 512). Part of the trade-off is additional CPU usage implied in streaming and deserializing a larger amount of data per average sstable index read; part of the trade-off is also effects on I/O; a sparser index sampling could result in a higher amount of seeks per index lookup. > Basic problem with cassandra daily administration which i am currently > solving is that memory consumption grows with your dataset size. I dont > really like this design - you put more data in and cluster can OOM. This > makes cassandra not optimal solution for use in data archiving. It will get > better after tunable bloom filters will be committed. That is a good reason for both to be configurable IMO. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)