Just in case some one uses the equations on that page, there is a small mathematical mistake. The exponent is missing a -ve sign, so the error rate is : ( 1 - exp(-kn/m) )^k .
Mohamed On Mon, Sep 20, 2010 at 3:04 PM, Peter Schuller <peter.schul...@infidyne.com > wrote: > > Actually, the points you make are things I have overlooked and actually > make > > me feel more comfortable about how cassandra will perform for my use > cases. > > I'm interested, in my case, to find out what the bloom filter > > false-positive rate is. Hopefully, a stat is kept on this. > > Assuming lack of implementation bugs and a good enough hash algorithm, > the false positive rate of bloom filters are mathematically > determined. See: > > http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html > > And in cassandra: > > java/org/apache/cassandra/utils/BloomCalculations.java > java/org/apache/cassandra/utils/BloomFilter.java > > (I don't know without checking (no time right now) whether the false > positive rate is actually tracked or not.) > > > As long as > > ALL of the bloom filters are in memory, the hit should be minimal for a > > Bloom filters are by design in memory at all times (they are the worst > possible case you can imagine in terms of random access, so it would > never make sense to keep them on disk even partially). > > (This assumes the JVM isn't being otherwise swapped out, which is > another issue.) > > > Good point on the row cache. I had actually misread the comments in the > > yaml, mistaking "do not use on ColumnFamilies with LARGE ROWS" , as "do > not > > use on ColumnFamilies with a LARGE NUMBER OF ROWS". I don't know if > this > > will improve performance much since I don't understand yet if this > > eliminates the need to check for the data in the SStables. If it > doesn't > > then what is the point of the row cache since the data is also in an > > in-memory memtable? > > It does eliminate the need to go down to sstables. It also survives > compactions (so doesn't go cold when sstables are replaced). > > Reasons to not use the row cache with large rows include: > > * In general it's a waste of memory better given to the OS page cache, > unless possibly you're continually reading entire rows rather than > subsets of rows. > > * For truly large rows you may have immediate issues with the size of > the data being cached; e.g. attempting to cache a 2 GB row is not the > best idea in terms of heap space consumption; you'll likely OOM or > trigger fallbacks to full GC, etc. > > * Having a larger key cache may often be more productive. > > > That aside, splitting the memtable in 2, could make checking the bloom > > filters unnecessary in most cases for me, but I'm not sure it's worth the > > effort. > > Write-through row caching seems like a more direct approach to me > personally, off hand. Also to the extent that you're worried about > false positive rates, larger bloom filters may still be an option (not > currently configurable; would require source changes). > > -- > / Peter Schuller >