> This particular cf has up to ~10 billion rows over 3 nodes. Each row is

With default settings, 143 million keys roughly gives you 2^31 bits of
bloom filter. Or put another way, you get about 1 GB of bloom filters
per 570 million keys, if I'm not mistaken. If you have 10 billion
rows, that should be roughly 20 gigs, plus any overhead caused by rows
appearing in multiple sstables.

Are you doing RF=1? That would explain how you're fitting this into 3
nodes with a heap size of 12 gb. If not, I'm probably making a mistake
in my calculation :)

> very small, <1k. Data from this cf is only read via hadoop jobs in batch
> reads of 16k rows at a time.
[snip]
> It's my understanding then for this use case that bloom filters are of
> little importance and that i can

Depends. I'm not familiar enough with how the hadoop integration works
so someone else will have to comment, but if your hadoop jobs are just
performan normal reads of keys via thrift and the keys they are
grabbing are not in token order, those reads would be effectively
random and bloom filters should still be highly relevant to the amount
of I/O operations you need to perform.

>  - upgrade to 1.0.7
>  - set fp_ratio=0.99
>  - set index_interval=1024
>
> This should alleviate much of the memory problems.
> Is this correct?

Provided that you do indeed not need the BF:s, then yeah. For the
record I have not yet personally tried the fp_ratio setting, but it
certainly should significantly decrease memory use.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to