Re: [collections] Bloom filters

Alex Herbert Sun, 01 Mar 2020 06:49:45 -0800


> On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> wrote:
> 
> The idea of a backing array is fine and the only problem I see with it is
> in very large filters (on the order of 10^8 bits and larger) but document
> the size calculation and let the developer worry about it.


Let us look at the use case where we max out the array. Using the Bloom filter 
calculator:

n = 149,363,281
p = 0.001000025 (1 in 1000)
m = 2147483647 (256MiB)
k = 10

n = 74,681,641
p = 0.000001 (1 in 999950)
m = 2147483647 (256MiB)
k = 20

n = 49,787,761
p = 0.000000001 (1 in 999924899)
m = 2147483647 (256MiB)
k = 30

So you will be able to put somewhere in the order of 10^8 or 10^7 items into 
the filter. I would say that anyone putting more than that into the filter has 
an unusual use case. The CountingBloomFilter can throw an exception if m is too 
large and will throw an OutOfMemoryError if you cannot allocate an array large 
enough. 

One clear point here is that you cannot use a short as a 16-bit count would 
easily overflow. So you have to use an integer array for the counts.

A maximum length int[] is roughly 8GB.

What would another implementation cost in terms of memory? The TreeMap<Integer, 
Integer> was the most space efficient. In the previous e-mail the saturation of 
a Bloom filter bits was approximately 50% when at the intended capacity. So we 
have to estimate the size of a TreeMap containing Integer.MAX_VALUE/2 indices ~ 
2^30. The memory test shows the TreeMap memory scales linearly with entries:

 32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =  1834061 bytes
 65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =  3669080 bytes
131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =  7339090 bytes

So what is the memory for a TreeMap with 2^30 indices. I make it about:

(2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB

I would say that this amount of RAM is unusual. It is definitely not as 
efficient as using an array. So very large counting Bloom filters are going to 
require some thought as to the hardware they run on. This may not be the case 
in 10 years time.

I would say that we try an int[] backing array for the storage implementation 
and document it’s limitations. A different implementation could be provided in 
future if required.

This could be done by making CountingBloomFilter an interface that extends 
BloomFilter with the methods:

subtract(BloomFilter filter)
subtract(Hasher filter)

These will negate the effect of the corresponding merge(BloomFilter) operation.

Do we also need access to the counts and add/subtract of another 
CountingBloomFilter?:

add(CountingBloomFilter filter);
subtract(CountingBloomFilter filter);

Iterator<int[]> getCounts();
int getSize(); // Number of items added

The CountingBloomFilter is then an interface that defines how to reverse the 
merge of some bits into the filter.

My concern is the inefficiency of the creation of objects in any method that 
provides access to the counts (e.g. above using an iterator as for 
Hasher.getBits()). I presume this method would be to allow some type of 
storage/serialisation of the filter akin to the long[] getBits() method of 
BloomFilter. So it may be better done using a method:

int getCount(int index);

The caller can then use long[] getBits() to get the indices set in the filter 
and then for each non-zero bit index call getCount(index). Or just not include 
the method as the counts are only of concern when storing the filter. This 
functionality is cleaner pushed into an implementation.

In a previous post we discussed whether to throw an exception on 
overflow/underflow or raise in an invalid flag. Using the invalid flag idea the 
interface would be:

interface CountingBloomFilter {
    int add(CountingBloomFilter filter);
    int subtract(BloomFilter filter);
    int subtract(Hasher filter);
    int subtract(CountingBloomFilter filter);
    int getStatus();

    // Maybe
    int getSize();
    int getCount(int index);
}

The status would be negative if any operation overflowed/underflowed, or zero 
if OK. The current status is returned by the add/subtract methods.

However I note that overflow may not be a concern. The number of items to add 
to a filter to create overflow would be using a filter with a number of bits 
that is unrealistic to store in memory:

n = 2147483647
p = 0.001000025 (1 in 1000)
m = 30875634182 (3.59GiB)
k = 10

If you want to add 2 billion items (and overflow an integer count) then your 
filter would be so big it would break the rest of the API that uses a 32-bit 
int for the bit index.

Thus only underflow is a realistic concern. This could be documented as handled 
in an implementation specific manner (i.e. throw or ignore). The API is then 
simplified to:

interface CountingBloomFilter {
    boolean add(CountingBloomFilter filter);
    boolean subtract(BloomFilter filter);
    boolean subtract(Hasher filter);
    boolean subtract(CountingBloomFilter filter);
    int getStatus();

    // Maybe
    int getSize();
    int getCount(int index);
}

The boolean is used to state that add/subtract did not over/underflow. 
Implementations can throw if they require it.

The question then becomes what does getSize() represent if an add/subtract did 
not execute cleanly. Under this scheme it would be the number of (add - 
subtract) operations. The status flag would be used to indicate if the size is 
valid, or any of the counts from getCount(). The simpler API is to not allow 
access to counts/size or adding/subtracting counts:

interface CountingBloomFilter {
    boolean subtract(BloomFilter filter);
    boolean subtract(Hasher filter);
    int getStatus();
    // Or something like ...
    boolean isValid();
}

This filter is then only concerned with reversing the merge of Bloom filters 
with a valid status flag to indicate that the current state is consistent (i.e. 
all filters have been cleanly merged/subtracted).

WDYT?

> 
> As for the merge question.  merge is a standard bloom filter operation.  It
> is well defined in the literature.  Merging a bloom filter into a counting
> bloom filter means incrementing the bit counts.  I  think that merge/remove
> should continue to operate as though the parameter were a standard bloom
> filter.
> 

OK. So the count is to represent the number of filters that had a bit set at 
that index. This makes it more clear.

> We had spoken of implementing and adding/deleting method pair that would
> operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
> add(CountingBloomFilter) and subtract(CountingBloomFilter))
> 
> I disagree with your proposal for the merge(Hasher) implementation, and I
> am not certain that an add(Hasher) makes sense.  First consider that the
> Hasher returns the bits that are to be enabled in the Bloom filter so
> collisions are expected.  In the normal course of events a Hasher is used
> to create a normal Bloom filter where all the duplicates are removed.  That
> filter is then merged into a CountingBloomFilter.  So in some sense the
> Hasher and the normal Bloom filter are the same.  So I would expect the
> merge of the Hasher and the merge of the normal Bloom filter created from
> that hasher into a CountingBloomFilter to yield the same result.  If you
> wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
> CountingBloomFilter you could implement with duplicate counting, but I am
> not certain of the validity of such a count and I fear that it muddies the
> waters with respect to what the CountingBloomFilter is counting.

Agreed.

> 
> Claude
> 
> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>>
> wrote:
> 
>> 
>> 
>>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com 
>>> <mailto:cla...@xenei.com>> wrote:
>>> 
>>> Alex would you take a look at pull request 131 [1].  it adds a new hasher
>>> implementation and makes the HashFunctionValidator available for public
>> use.
>>> 
>>> https://github.com/apache/commons-collections/pull/131 
>>> <https://github.com/apache/commons-collections/pull/131> <
>> https://github.com/apache/commons-collections/pull/131 
>> <https://github.com/apache/commons-collections/pull/131>>
>> 
>> OK. I’ll take a look.
>> 
>> I’ve been thinking about the counting Bloom filter and the backing
>> storage. In summary:
>> 
>> 1. The backing storage should be a fixed array.
>> 2. Merging a Hasher should count duplicate indices, not flatten them all
>> to a single count.
>> 
>> For background I’ve used the standard formulas to estimate the number of
>> indices that will be non-zero in a Bloom filter. The wikipedia page gives
>> this formula for the expected number of bits set to 0 (E(q)) if you have
>> inserted i elements into a filter of size m using k hash functions:
>> 
>> E(q) = (1 - 1/m)^ki"
>> 
>> So a rough guess of the number of indices (bits) used by a filter is
>> 1-E(q).
>> 
>> Here is a table of Bloom filters with different collision probabilities
>> and the proportion of bits that will be set when 1%, 10%, 100% of the
>> capacity of the filter has been met:
>> 
>> n       p       m       k       I       E(q)    bits
>> 1000    1E-04   19171   13      10      0.9932  0.0068
>> 1000    1E-04   19171   13      100     0.9344  0.0656
>> 1000    1E-04   19171   13      1000    0.5076  0.4924
>> 1000    1E-05   23963   17      10      0.9929  0.0071
>> 1000    1E-05   23963   17      100     0.9315  0.0685
>> 1000    1E-05   23963   17      1000    0.4919  0.5081
>> 1000    1E-06   28756   20      10      0.9931  0.0069
>> 1000    1E-06   28756   20      100     0.9328  0.0672
>> 1000    1E-06   28756   20      1000    0.4988  0.5012
>> 10000   1E-06   287552  20      100     0.9931  0.0069
>> 10000   1E-06   287552  20      1000    0.9328  0.0672
>> 10000   1E-06   287552  20      10000   0.4988  0.5012
>> 
>> The point is that if you create a Bloom filter and fill it to 10% of the
>> intended capacity the number of indices used will be about 6-7% of the
>> filter bits.
>> 
>> So how to store the counts? Currently the counting bloom filter uses a
>> TreeMap<Integer, Integer>. I tried:
>> 
>> TreeMap<Integer, Integer>
>> HashMap<Integer, Integer>
>> TreeSet<MutableCount>
>> HashSet<MutableCount>
>> int[]
>> 
>> The MutableCount is a custom object that stores the bit index and uses it
>> for a hash code and then has a mutable integer count field. It allows the
>> count to be incremented/decremented if the object is in the set:
>> 
>>    static final class MutableCount implements Comparable<MutableCount> {
>>        final int key;
>>        int count;
>>        // etc
>>    }
>> 
>> This is adapted from the Bag<T> collection which stores an item count with
>> a MutableInteger. Here the mutable count is part of the same object T
>> inserted into the Set. So you can find the object, change the count and not
>> have to put it back into the set. This is more efficient than changing the
>> Integer stored in a Map.
>> 
>> I’ve estimated memory usage using an idea based on this article from
>> JavaWorld: Java Tip 130: Do you know your data size? [1].
>> 
>> Basically you:
>> 
>> - create an object and throw it away. All classes are then initialised.
>> - Then you free memory (run garbage collection) and get the current memory
>> usage
>> - Then create a lot of your object (held in an array)
>> - Then measure memory usage again
>> - memory = (after - before) / count
>> 
>> Here is some example output for n bits set in size m:
>> 
>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
>> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
>> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
>> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
>> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
>> 13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
>> 26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
>>     0 / 262144 (0.000) : int[]                          =  1048608 bytes
>>     0 / 262144 (0.000) : short[]                        =   524320 bytes
>> 
>> The estimate is accurate to 0.0001% for the arrays so the method is
>> working. The HashMap was created with the capacity set to the expected
>> capacity of the filter (m).
>> 
>> I’ve chosen these sizes because at 5% full a HashSet is less memory
>> efficient than using a fixed size array, and at 10% the TreeSet is also
>> less efficient.
>> 
>> Note that the java.util.Tree/HashSet versions just wrap a Map and insert a
>> dummy object for all keys in the Map. So here a Set is not as efficient as
>> a Map because in the Map test I always inserted the same Integer object
>> representing 1. This would be the same as using a Set with an Integer key
>> but here the Set had to contain the MutableCount which has an extra int
>> field and is larger than an Integer.
>> 
>> These data lead me to think that a counting Bloom filter should just use a
>> fixed size backing array:
>> 
>> - If created using the same Shape as a standard Bloom filter it uses a
>> fixed size. This has high memory cost when the filter is empty but when it
>> exceeds 10% of the intended capacity it is more efficient than a dynamic
>> backing storage.
>> - All operations will operate in order(n) time for an operation with
>> another filter with n indices. Each individual index count in the filter
>> will have order(1) time for access/update. Performance will be limited by
>> the memory cache of the entire array.
>> 
>> The issue is that a counting Bloom filter with a very low number of
>> inserted items will be memory inefficient. But under what circumstance will
>> such a filter be used in a short-term lifecycle? If it is simply to merge
>> into another filter then this can be done using a merge with a Hasher. If
>> counts are to be merged then perhaps we provide a method to merge counts
>> using the same data structure returned by the CountingBloomFilter
>> getCounts() method, e.g. using a stream of <index,count> pairs:
>> 
>> Stream<int[]> getCounts();
>> void add(Stream<int[]> counts);
>> 
>> The issue here is the Shape and HashFunctionIdentity of the origin of the
>> merge cannot be validated. So just leave it out and use the merge with a
>> Hasher.
>> 
>> 
>> Thus the next issue with the counting Bloom filter implementation.
>> Currently when it merges with a Hasher it puts all the indices into a Set
>> and so will only increment the count by 1 for each index identified by the
>> Hasher. This appears to miss the entire point of the counting Bloom filter.
>> If I hash an objects to generate k indices I would hope that I do get k
>> indices. But the hash may not be perfect and I may get [1, k] indices with
>> some duplications. This is part of the signature of that object with the
>> given hash. So surely a counting Bloom filter should accommodate this. If
>> my Hasher generates the same index 20 times this should result in the count
>> of that index incrementing by 20.
>> 
>> The result if that if an object is added direct to a counting Bloom filter
>> using a Hasher it will have a different result that if added to a standard
>> Bloom filter and then that filter added to the counting Bloom filter.
>> 
>> Opinions on this?
>> 
>> Alex
>> 
>> 
>> 
>> [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
>> 
>>> 
>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>> 
>>>> I have created a PR that contains most of the changes discussed in this
>>>> thread (see [1]).
>>>> 
>>>> Please review the changes and comment here or on GitHub.
>>>> 
>>>> I have left the CountingBloomFilter alone. I will reimplement the
>>>> add/subtract functionality as discussed into another PR.
>>>> 
>>>> Alex
>>>> 
>>>> [1] https://github.com/apache/commons-collections/pull/133 <
>>>> https://github.com/apache/commons-collections/pull/133>
>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>> 
>> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren 
> <http://www.linkedin.com/in/claudewarren>

Re: [collections] Bloom filters

Reply via email to