Re: [collections] Bloom filters

Alex Herbert Sat, 29 Feb 2020 06:13:41 -0800


> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com> wrote:
> 
> Alex would you take a look at pull request 131 [1].  it adds a new hasher
> implementation and makes the HashFunctionValidator available for public use.
> 
> https://github.com/apache/commons-collections/pull/131 
> <https://github.com/apache/commons-collections/pull/131>

OK. I’ll take a look.

I’ve been thinking about the counting Bloom filter and the backing storage. In 
summary:

1. The backing storage should be a fixed array.
2. Merging a Hasher should count duplicate indices, not flatten them all to a 
single count.

For background I’ve used the standard formulas to estimate the number of 
indices that will be non-zero in a Bloom filter. The wikipedia page gives this 
formula for the expected number of bits set to 0 (E(q)) if you have inserted i 
elements into a filter of size m using k hash functions:

E(q) = (1 - 1/m)^ki"

So a rough guess of the number of indices (bits) used by a filter is 1-E(q).

Here is a table of Bloom filters with different collision probabilities and the 
proportion of bits that will be set when 1%, 10%, 100% of the capacity of the 
filter has been met:

n       p       m       k       I       E(q)    bits
1000    1E-04   19171   13      10      0.9932  0.0068
1000    1E-04   19171   13      100     0.9344  0.0656
1000    1E-04   19171   13      1000    0.5076  0.4924
1000    1E-05   23963   17      10      0.9929  0.0071
1000    1E-05   23963   17      100     0.9315  0.0685
1000    1E-05   23963   17      1000    0.4919  0.5081
1000    1E-06   28756   20      10      0.9931  0.0069
1000    1E-06   28756   20      100     0.9328  0.0672
1000    1E-06   28756   20      1000    0.4988  0.5012
10000   1E-06   287552  20      100     0.9931  0.0069
10000   1E-06   287552  20      1000    0.9328  0.0672
10000   1E-06   287552  20      10000   0.4988  0.5012

The point is that if you create a Bloom filter and fill it to 10% of the 
intended capacity the number of indices used will be about 6-7% of the filter 
bits.

So how to store the counts? Currently the counting bloom filter uses a 
TreeMap<Integer, Integer>. I tried:

TreeMap<Integer, Integer>
HashMap<Integer, Integer>
TreeSet<MutableCount>
HashSet<MutableCount>
int[]

The MutableCount is a custom object that stores the bit index and uses it for a 
hash code and then has a mutable integer count field. It allows the count to be 
incremented/decremented if the object is in the set:

    static final class MutableCount implements Comparable<MutableCount> {
        final int key;
        int count;
        // etc
    }

This is adapted from the Bag<T> collection which stores an item count with a 
MutableInteger. Here the mutable count is part of the same object T inserted 
into the Set. So you can find the object, change the count and not have to put 
it back into the set. This is more efficient than changing the Integer stored 
in a Map.

I’ve estimated memory usage using an idea based on this article from JavaWorld: 
Java Tip 130: Do you know your data size? [1].

Basically you:

- create an object and throw it away. All classes are then initialised.
- Then you free memory (run garbage collection) and get the current memory usage
- Then create a lot of your object (held in an array)
- Then measure memory usage again
- memory = (after - before) / count

Here is some example output for n bits set in size m: 

 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
 13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
 26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
 13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
 26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
     0 / 262144 (0.000) : int[]                          =  1048608 bytes
     0 / 262144 (0.000) : short[]                        =   524320 bytes

The estimate is accurate to 0.0001% for the arrays so the method is working. 
The HashMap was created with the capacity set to the expected capacity of the 
filter (m).

I’ve chosen these sizes because at 5% full a HashSet is less memory efficient 
than using a fixed size array, and at 10% the TreeSet is also less efficient.

Note that the java.util.Tree/HashSet versions just wrap a Map and insert a 
dummy object for all keys in the Map. So here a Set is not as efficient as a 
Map because in the Map test I always inserted the same Integer object 
representing 1. This would be the same as using a Set with an Integer key but 
here the Set had to contain the MutableCount which has an extra int field and 
is larger than an Integer.

These data lead me to think that a counting Bloom filter should just use a 
fixed size backing array:

- If created using the same Shape as a standard Bloom filter it uses a fixed 
size. This has high memory cost when the filter is empty but when it exceeds 
10% of the intended capacity it is more efficient than a dynamic backing 
storage.
- All operations will operate in order(n) time for an operation with another 
filter with n indices. Each individual index count in the filter will have 
order(1) time for access/update. Performance will be limited by the memory 
cache of the entire array.

The issue is that a counting Bloom filter with a very low number of inserted 
items will be memory inefficient. But under what circumstance will such a 
filter be used in a short-term lifecycle? If it is simply to merge into another 
filter then this can be done using a merge with a Hasher. If counts are to be 
merged then perhaps we provide a method to merge counts using the same data 
structure returned by the CountingBloomFilter getCounts() method, e.g. using a 
stream of <index,count> pairs:

Stream<int[]> getCounts();
void add(Stream<int[]> counts);

The issue here is the Shape and HashFunctionIdentity of the origin of the merge 
cannot be validated. So just leave it out and use the merge with a Hasher.

Thus the next issue with the counting Bloom filter implementation. Currently 
when it merges with a Hasher it puts all the indices into a Set and so will 
only increment the count by 1 for each index identified by the Hasher. This 
appears to miss the entire point of the counting Bloom filter. If I hash an 
objects to generate k indices I would hope that I do get k indices. But the 
hash may not be perfect and I may get [1, k] indices with some duplications. 
This is part of the signature of that object with the given hash. So surely a 
counting Bloom filter should accommodate this. If my Hasher generates the same 
index 20 times this should result in the count of that index incrementing by 20.

The result if that if an object is added direct to a counting Bloom filter 
using a Hasher it will have a different result that if added to a standard 
Bloom filter and then that filter added to the counting Bloom filter.

Opinions on this?

Alex

[1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html

> 
> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> 
>> I have created a PR that contains most of the changes discussed in this
>> thread (see [1]).
>> 
>> Please review the changes and comment here or on GitHub.
>> 
>> I have left the CountingBloomFilter alone. I will reimplement the
>> add/subtract functionality as discussed into another PR.
>> 
>> Alex
>> 
>> [1] https://github.com/apache/commons-collections/pull/133 <
>> https://github.com/apache/commons-collections/pull/133>
>> 
>> 
>> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to