Re: [collections] Bloom filters

Claude Warren Sun, 01 Mar 2020 07:39:57 -0800

I think the CountingBloomFilter interface needs to extend BloomFilter.

I think I am confused.


I would expect CountingBloomFilter to have

interface CountingBloomFilter extends BloomFilter {
    // these 2 methods are the reverse of merge()
    void remove( BloomFilter );
    void remove( Hasher );

    // these 2 methods are the addition/subtraction of counts
    void add( CountingBloomFilter )
    void subtract( CountingBloomFilter );

    // 2 methods to retrieve data
    Stream<long[]> getCounts();
    boolean isValid()
}

Claude


On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
>
> > On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> wrote:
> >
> > The idea of a backing array is fine and the only problem I see with it is
> > in very large filters (on the order of 10^8 bits and larger) but document
> > the size calculation and let the developer worry about it.
>
> Let us look at the use case where we max out the array. Using the Bloom
> filter calculator:
>
> n = 149,363,281
> p = 0.001000025 (1 in 1000)
> m = 2147483647 (256MiB)
> k = 10
>
> n = 74,681,641
> p = 0.000001 (1 in 999950)
> m = 2147483647 (256MiB)
> k = 20
>
> n = 49,787,761
> p = 0.000000001 (1 in 999924899)
> m = 2147483647 (256MiB)
> k = 30
>
> So you will be able to put somewhere in the order of 10^8 or 10^7 items
> into the filter. I would say that anyone putting more than that into the
> filter has an unusual use case. The CountingBloomFilter can throw an
> exception if m is too large and will throw an OutOfMemoryError if you
> cannot allocate an array large enough.
>
> One clear point here is that you cannot use a short as a 16-bit count
> would easily overflow. So you have to use an integer array for the counts.
>
> A maximum length int[] is roughly 8GB.
>
> What would another implementation cost in terms of memory? The
> TreeMap<Integer, Integer> was the most space efficient. In the previous
> e-mail the saturation of a Bloom filter bits was approximately 50% when at
> the intended capacity. So we have to estimate the size of a TreeMap
> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows the
> TreeMap memory scales linearly with entries:
>
>  32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =  1834061 bytes
>  65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =  3669080 bytes
> 131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =  7339090 bytes
>
> So what is the memory for a TreeMap with 2^30 indices. I make it about:
>
> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
>
> I would say that this amount of RAM is unusual. It is definitely not as
> efficient as using an array. So very large counting Bloom filters are going
> to require some thought as to the hardware they run on. This may not be the
> case in 10 years time.
>
> I would say that we try an int[] backing array for the storage
> implementation and document it’s limitations. A different implementation
> could be provided in future if required.
>
> This could be done by making CountingBloomFilter an interface that extends
> BloomFilter with the methods:
>
> subtract(BloomFilter filter)
> subtract(Hasher filter)
>
> These will negate the effect of the corresponding merge(BloomFilter)
> operation.
>
> Do we also need access to the counts and add/subtract of another
> CountingBloomFilter?:
>
> add(CountingBloomFilter filter);
> subtract(CountingBloomFilter filter);
>
> Iterator<int[]> getCounts();
> int getSize(); // Number of items added
>
> The CountingBloomFilter is then an interface that defines how to reverse
> the merge of some bits into the filter.
>
> My concern is the inefficiency of the creation of objects in any method
> that provides access to the counts (e.g. above using an iterator as for
> Hasher.getBits()). I presume this method would be to allow some type of
> storage/serialisation of the filter akin to the long[] getBits() method of
> BloomFilter. So it may be better done using a method:
>
> int getCount(int index);
>
> The caller can then use long[] getBits() to get the indices set in the
> filter and then for each non-zero bit index call getCount(index). Or just
> not include the method as the counts are only of concern when storing the
> filter. This functionality is cleaner pushed into an implementation.
>
> In a previous post we discussed whether to throw an exception on
> overflow/underflow or raise in an invalid flag. Using the invalid flag idea
> the interface would be:
>
> interface CountingBloomFilter {
>     int add(CountingBloomFilter filter);
>     int subtract(BloomFilter filter);
>     int subtract(Hasher filter);
>     int subtract(CountingBloomFilter filter);
>     int getStatus();
>
>     // Maybe
>     int getSize();
>     int getCount(int index);
> }
>
> The status would be negative if any operation overflowed/underflowed, or
> zero if OK. The current status is returned by the add/subtract methods.
>
> However I note that overflow may not be a concern. The number of items to
> add to a filter to create overflow would be using a filter with a number of
> bits that is unrealistic to store in memory:
>
> n = 2147483647
> p = 0.001000025 (1 in 1000)
> m = 30875634182 (3.59GiB)
> k = 10
>
> If you want to add 2 billion items (and overflow an integer count) then
> your filter would be so big it would break the rest of the API that uses a
> 32-bit int for the bit index.
>
> Thus only underflow is a realistic concern. This could be documented as
> handled in an implementation specific manner (i.e. throw or ignore). The
> API is then simplified to:
>
> interface CountingBloomFilter {
>     boolean add(CountingBloomFilter filter);
>     boolean subtract(BloomFilter filter);
>     boolean subtract(Hasher filter);
>     boolean subtract(CountingBloomFilter filter);
>     int getStatus();
>
>     // Maybe
>     int getSize();
>     int getCount(int index);
> }
>
> The boolean is used to state that add/subtract did not over/underflow.
> Implementations can throw if they require it.
>
> The question then becomes what does getSize() represent if an add/subtract
> did not execute cleanly. Under this scheme it would be the number of (add -
> subtract) operations. The status flag would be used to indicate if the size
> is valid, or any of the counts from getCount(). The simpler API is to not
> allow access to counts/size or adding/subtracting counts:
>
> interface CountingBloomFilter {
>     boolean subtract(BloomFilter filter);
>     boolean subtract(Hasher filter);
>     int getStatus();
>     // Or something like ...
>     boolean isValid();
> }
>
> This filter is then only concerned with reversing the merge of Bloom
> filters with a valid status flag to indicate that the current state is
> consistent (i.e. all filters have been cleanly merged/subtracted).
>
> WDYT?
>
> >
> > As for the merge question.  merge is a standard bloom filter operation.
> It
> > is well defined in the literature.  Merging a bloom filter into a
> counting
> > bloom filter means incrementing the bit counts.  I  think that
> merge/remove
> > should continue to operate as though the parameter were a standard bloom
> > filter.
> >
>
> OK. So the count is to represent the number of filters that had a bit set
> at that index. This makes it more clear.
>
> > We had spoken of implementing and adding/deleting method pair that would
> > operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
> > add(CountingBloomFilter) and subtract(CountingBloomFilter))
> >
> > I disagree with your proposal for the merge(Hasher) implementation, and I
> > am not certain that an add(Hasher) makes sense.  First consider that the
> > Hasher returns the bits that are to be enabled in the Bloom filter so
> > collisions are expected.  In the normal course of events a Hasher is used
> > to create a normal Bloom filter where all the duplicates are removed.
> That
> > filter is then merged into a CountingBloomFilter.  So in some sense the
> > Hasher and the normal Bloom filter are the same.  So I would expect the
> > merge of the Hasher and the merge of the normal Bloom filter created from
> > that hasher into a CountingBloomFilter to yield the same result.  If you
> > wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
> > CountingBloomFilter you could implement with duplicate counting, but I am
> > not certain of the validity of such a count and I fear that it muddies
> the
> > waters with respect to what the CountingBloomFilter is counting.
>
> Agreed.
>
> >
> > Claude
> >
> > On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <alex.d.herb...@gmail.com
> <mailto:alex.d.herb...@gmail.com>>
> > wrote:
> >
> >>
> >>
> >>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com <mailto:
> cla...@xenei.com>> wrote:
> >>>
> >>> Alex would you take a look at pull request 131 [1].  it adds a new
> hasher
> >>> implementation and makes the HashFunctionValidator available for public
> >> use.
> >>>
> >>> https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131> <
> >> https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131>>
> >>
> >> OK. I’ll take a look.
> >>
> >> I’ve been thinking about the counting Bloom filter and the backing
> >> storage. In summary:
> >>
> >> 1. The backing storage should be a fixed array.
> >> 2. Merging a Hasher should count duplicate indices, not flatten them all
> >> to a single count.
> >>
> >> For background I’ve used the standard formulas to estimate the number of
> >> indices that will be non-zero in a Bloom filter. The wikipedia page
> gives
> >> this formula for the expected number of bits set to 0 (E(q)) if you have
> >> inserted i elements into a filter of size m using k hash functions:
> >>
> >> E(q) = (1 - 1/m)^ki"
> >>
> >> So a rough guess of the number of indices (bits) used by a filter is
> >> 1-E(q).
> >>
> >> Here is a table of Bloom filters with different collision probabilities
> >> and the proportion of bits that will be set when 1%, 10%, 100% of the
> >> capacity of the filter has been met:
> >>
> >> n       p       m       k       I       E(q)    bits
> >> 1000    1E-04   19171   13      10      0.9932  0.0068
> >> 1000    1E-04   19171   13      100     0.9344  0.0656
> >> 1000    1E-04   19171   13      1000    0.5076  0.4924
> >> 1000    1E-05   23963   17      10      0.9929  0.0071
> >> 1000    1E-05   23963   17      100     0.9315  0.0685
> >> 1000    1E-05   23963   17      1000    0.4919  0.5081
> >> 1000    1E-06   28756   20      10      0.9931  0.0069
> >> 1000    1E-06   28756   20      100     0.9328  0.0672
> >> 1000    1E-06   28756   20      1000    0.4988  0.5012
> >> 10000   1E-06   287552  20      100     0.9931  0.0069
> >> 10000   1E-06   287552  20      1000    0.9328  0.0672
> >> 10000   1E-06   287552  20      10000   0.4988  0.5012
> >>
> >> The point is that if you create a Bloom filter and fill it to 10% of the
> >> intended capacity the number of indices used will be about 6-7% of the
> >> filter bits.
> >>
> >> So how to store the counts? Currently the counting bloom filter uses a
> >> TreeMap<Integer, Integer>. I tried:
> >>
> >> TreeMap<Integer, Integer>
> >> HashMap<Integer, Integer>
> >> TreeSet<MutableCount>
> >> HashSet<MutableCount>
> >> int[]
> >>
> >> The MutableCount is a custom object that stores the bit index and uses
> it
> >> for a hash code and then has a mutable integer count field. It allows
> the
> >> count to be incremented/decremented if the object is in the set:
> >>
> >>    static final class MutableCount implements Comparable<MutableCount> {
> >>        final int key;
> >>        int count;
> >>        // etc
> >>    }
> >>
> >> This is adapted from the Bag<T> collection which stores an item count
> with
> >> a MutableInteger. Here the mutable count is part of the same object T
> >> inserted into the Set. So you can find the object, change the count and
> not
> >> have to put it back into the set. This is more efficient than changing
> the
> >> Integer stored in a Map.
> >>
> >> I’ve estimated memory usage using an idea based on this article from
> >> JavaWorld: Java Tip 130: Do you know your data size? [1].
> >>
> >> Basically you:
> >>
> >> - create an object and throw it away. All classes are then initialised.
> >> - Then you free memory (run garbage collection) and get the current
> memory
> >> usage
> >> - Then create a lot of your object (held in an array)
> >> - Then measure memory usage again
> >> - memory = (after - before) / count
> >>
> >> Here is some example output for n bits set in size m:
> >>
> >> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
> >> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
> >> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
> >> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
> >> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
> >> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
> >> 13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
> >> 26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
> >>     0 / 262144 (0.000) : int[]                          =  1048608 bytes
> >>     0 / 262144 (0.000) : short[]                        =   524320 bytes
> >>
> >> The estimate is accurate to 0.0001% for the arrays so the method is
> >> working. The HashMap was created with the capacity set to the expected
> >> capacity of the filter (m).
> >>
> >> I’ve chosen these sizes because at 5% full a HashSet is less memory
> >> efficient than using a fixed size array, and at 10% the TreeSet is also
> >> less efficient.
> >>
> >> Note that the java.util.Tree/HashSet versions just wrap a Map and
> insert a
> >> dummy object for all keys in the Map. So here a Set is not as efficient
> as
> >> a Map because in the Map test I always inserted the same Integer object
> >> representing 1. This would be the same as using a Set with an Integer
> key
> >> but here the Set had to contain the MutableCount which has an extra int
> >> field and is larger than an Integer.
> >>
> >> These data lead me to think that a counting Bloom filter should just
> use a
> >> fixed size backing array:
> >>
> >> - If created using the same Shape as a standard Bloom filter it uses a
> >> fixed size. This has high memory cost when the filter is empty but when
> it
> >> exceeds 10% of the intended capacity it is more efficient than a dynamic
> >> backing storage.
> >> - All operations will operate in order(n) time for an operation with
> >> another filter with n indices. Each individual index count in the filter
> >> will have order(1) time for access/update. Performance will be limited
> by
> >> the memory cache of the entire array.
> >>
> >> The issue is that a counting Bloom filter with a very low number of
> >> inserted items will be memory inefficient. But under what circumstance
> will
> >> such a filter be used in a short-term lifecycle? If it is simply to
> merge
> >> into another filter then this can be done using a merge with a Hasher.
> If
> >> counts are to be merged then perhaps we provide a method to merge counts
> >> using the same data structure returned by the CountingBloomFilter
> >> getCounts() method, e.g. using a stream of <index,count> pairs:
> >>
> >> Stream<int[]> getCounts();
> >> void add(Stream<int[]> counts);
> >>
> >> The issue here is the Shape and HashFunctionIdentity of the origin of
> the
> >> merge cannot be validated. So just leave it out and use the merge with a
> >> Hasher.
> >>
> >>
> >> Thus the next issue with the counting Bloom filter implementation.
> >> Currently when it merges with a Hasher it puts all the indices into a
> Set
> >> and so will only increment the count by 1 for each index identified by
> the
> >> Hasher. This appears to miss the entire point of the counting Bloom
> filter.
> >> If I hash an objects to generate k indices I would hope that I do get k
> >> indices. But the hash may not be perfect and I may get [1, k] indices
> with
> >> some duplications. This is part of the signature of that object with the
> >> given hash. So surely a counting Bloom filter should accommodate this.
> If
> >> my Hasher generates the same index 20 times this should result in the
> count
> >> of that index incrementing by 20.
> >>
> >> The result if that if an object is added direct to a counting Bloom
> filter
> >> using a Hasher it will have a different result that if added to a
> standard
> >> Bloom filter and then that filter added to the counting Bloom filter.
> >>
> >> Opinions on this?
> >>
> >> Alex
> >>
> >>
> >>
> >> [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
> >>
> >>>
> >>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <
> alex.d.herb...@gmail.com>
> >>> wrote:
> >>>
> >>>> I have created a PR that contains most of the changes discussed in
> this
> >>>> thread (see [1]).
> >>>>
> >>>> Please review the changes and comment here or on GitHub.
> >>>>
> >>>> I have left the CountingBloomFilter alone. I will reimplement the
> >>>> add/subtract functionality as discussed into another PR.
> >>>>
> >>>> Alex
> >>>>
> >>>> [1] https://github.com/apache/commons-collections/pull/133 <
> >>>> https://github.com/apache/commons-collections/pull/133>
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> I like: Like Like - The likeliest place on the web
> >>> <http://like-like.xenei.com>
> >>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com <http://like-like.xenei.com/>>
> > LinkedIn: http://www.linkedin.com/in/claudewarren <
> http://www.linkedin.com/in/claudewarren>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to