> This particular cf has up to ~10 billion rows over 3 nodes. Each row is
With default settings, 143 million keys roughly gives you 2^31 bits of bloom filter. Or put another way, you get about 1 GB of bloom filters per 570 million keys, if I'm not mistaken. If you have 10 billion rows, that should be roughly 20 gigs, plus any overhead caused by rows appearing in multiple sstables. Are you doing RF=1? That would explain how you're fitting this into 3 nodes with a heap size of 12 gb. If not, I'm probably making a mistake in my calculation :) > very small, <1k. Data from this cf is only read via hadoop jobs in batch > reads of 16k rows at a time. [snip] > It's my understanding then for this use case that bloom filters are of > little importance and that i can Depends. I'm not familiar enough with how the hadoop integration works so someone else will have to comment, but if your hadoop jobs are just performan normal reads of keys via thrift and the keys they are grabbing are not in token order, those reads would be effectively random and bloom filters should still be highly relevant to the amount of I/O operations you need to perform. > - upgrade to 1.0.7 > - set fp_ratio=0.99 > - set index_interval=1024 > > This should alleviate much of the memory problems. > Is this correct? Provided that you do indeed not need the BF:s, then yeah. For the record I have not yet personally tried the fp_ratio setting, but it certainly should significantly decrease memory use. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)