Re: [collections] Bloom filters

Manas Kalangan Mon, 02 Mar 2020 05:19:28 -0800

hey guys , i am manas , 2nd year computer engineering student, this is my
first time in GSoC, could someone help me with project idea?


On Mon, Mar 2, 2020 at 6:42 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
> On 02/03/2020 11:32, Claude Warren wrote:
> > my thought on changing the BloomFilter.merge() to return a boolean is
> along
> > the lines you had: return true on successful merge (even if there are no
> > changes in the enabled bits). And yes, for most cases the standard bloom
> > filter will return true, but the change is really to support extensions
> to
> > Bloom filter so I think it is reasonable.
> >
> > As for the getCounts().  Suppose we split it into two methods:
> >
> >      // return the indexes that are enabled.  This is equivalent to
> > CountingBloomFilter.getHasher().getBits( CountingBloomFilter.getShape()
> );
> >      Iterator<int> getIndexes()
> >      // get the count for the specific index.
> >      int getCount( int index );
> >
> > With these methods It becomes possible to construct an iterator of int[]
> or
> > Map.Entry<Integer,Integer> or whatever else the developer wants.
>
> Having to call getCount() for every index is likely to be slower than
> all the garbage collection involved with iterating a disposable <index,
> count> pair. For an array backed storage the access will be order(1).
> But other storage may be a lot slower. This method ideally should
> traverse the storage once and make each <index, count> pair available.
> Ideally this would be without creating each <index, count> pair as a
> disposable object.
>
> Following on from previously, another disadvantage of the consumer
> approach is the lack of fast fail. You can remedy this by adding a
> boolean return from the consumer to indicate if you want to consume more
> items:
>
> interface BitCountConsumer {
>      boolean accept(int index, int count);
> }
>
> A CountingBloomFilter implementation then has something like:
>
>      void getCounts(BitCountConsumer consumer) {
>          while(hasNext()) {
>              next();
>              if (!consumer.accept(index(), count()) {
>                  break;
>              }
>          }
>      }
>
> Looking at the getHasher().getBits(...) idea I noticed that StaticHasher
> is constructed using:
>
> Iterator<Integer> iter
>
> not:
>
> PrimitiveIterator.OfInt
>
> I'd prefer the primitive specialisation in this constructor. It would
> break the HasherBloomFilter merge function but that can be fixed.
> However it may be redundant. The StaticHasher has some oddities with the
> available constructors:
>
> public StaticHasher(final StaticHasher hasher, final Shape shape)
>
> - why have a shape parameter? The constructor checks the shape is the
> same as that for the hasher. So why not just a plain copy constructor:
>
> public StaticHasher(final StaticHasher hasher)
>
> This constructor is only used in the unit tests. Given the StaticHasher
> is immutable then a copy constructor seems odd. I think it should be
> dropped.
>
>
> Here are is a constructor that is missing IMO:
>
> public StaticHasher(final Set<Integer> indices, final Shape shape)
>
> - Could be used to generically create a static hasher. The indices are
> ensured to be unique by the Set but should be checked to be within the
> shape.
>
>
> With these constructors the public StaticHasher(final Iterator<Integer>
> iter, final Shape shape) is only used by the BitSetBloomFilter:
>
>      public StaticHasher getHasher() {
>          return new StaticHasher(bitSet.stream().iterator(), getShape());
>      }
>
> Becomes:
>
>      public StaticHasher getHasher() {
>          return new
> StaticHasher(bitSet.stream().boxed().collect(Collectors.toCollection(TreeSet::new)),
>
> getShape());
>      }
>
> This is a very roundabout way to construct a StaticHasher. Perhaps a
> constructor accepting a BitSet would be useful?
>
> So the API only requires:
>
> public StaticHasher(final Hasher hasher, final Shape shape)
> public StaticHasher(final Set<Integer> indices, final Shape shape)
> // Maybe
> public StaticHasher(final BitSet indices, final Shape shape)
>
>
> I wondered about a merge function for the StaticHasher:
>
> public StaticHasher merge(final StaticHasher other)
>
> The merge function can make some assumptions about the two arrays to
> merge as the class is final and the values are verified to be sorted and
> within the Shape. However there is not a use for it other than in the
> merge of two HasherBloomFilters, which is not a recommended operation
> anyway.
>
>
> > Claude
> >
> > On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert <alex.d.herb...@gmail.com>
> > wrote:
> >
> >> On 02/03/2020 09:38, Claude Warren wrote:
> >>> It is not too late to update the BloomFIlter interface to have the
> merge
> >>> return a boolean.  The incorrect Shape would still throw an exception,
> so
> >>> the return value would only come into play if the bits could not be
> set.
> >>>
> >>> thoughts?
> >> I don't see the harm in it. But what would the return value be for?
> >>
> >> For a standard collection it would be if the collection was changed by
> >> the operation:
> >>
> >> Collection.add/remove return "true if this collection changed as a
> >> result of the call"
> >>
> >> So here is the equivalent:
> >>
> >> return "true if this filter was changed as a result of the call"
> >>
> >> This is computationally slow to track. It also is confusing if the
> >> filter was successfully merged but no bits were changed to then return
> >> false because the filter was actually incorporated. So it would go along
> >> the lines that we discussed for the counting Bloom filter:
> >>
> >> return "true if this filter was successfully merged as a result of the
> >> call"
> >>
> >> For most cases in the current library it would be true when an exception
> >> is not thrown. However the merge of the counting Bloom filter may have
> >> reason to return false, e.g. overflow.
> >>
> >>> On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <cla...@xenei.com> wrote:
> >>>
> >>>> for the remove(), add(), and subtract() methods I agree that void is
> not
> >>>> correct and it should be boolean and be the same as the value you
> would
> >> get
> >>>> from calling isValid().
> >>>>
> >>>> You are correct the getCounts() should return an iterator of some type
> >> on
> >>>> int[], I don't know why I thought long[].  I am happy with a plain
> >>>> Iterator<int[]> as the return.
> >> For the getCounts() method I am still looking for a way around having to
> >> create an <index, count> pair for everything in the filter. An
> >> alternative to an iterator is to use the consumer idea. Given there is
> >> no primitive specialisation of BiConsumer<T, U> in JDK 8 functions we
> >> define our own:
> >>
> >> interface BitCountConsumer {
> >>       void accept(int index, int count);
> >> }
> >>
> >> The CountingBloomFilter then has:
> >>
> >>       void forEachCount(BitCountConsumer consumer);
> >>       // Or
> >>       void getCounts(BitCountConsumer consumer);
> >>
> >>
> >> You can then directly pass the counts from the backing storage to the
> >> destination.
> >>
> >> Advantages:
> >> - No object creation
> >> Disadvantages
> >> - The counts cannot be streamed
> >>
> >> An alternative is to provide an Iterator of long with the index and
> >> count packed into a long with methods to extract them:
> >>
> >>      PrimativeIterator.OfLong getCounts();
> >>
> >>      default static int getIndex(long indexCountPair) {
> >>           return (int) (indexCountPair >>> 32);
> >>      }
> >>
> >>      default static int getCount(long indexCountPair) {
> >>           return (int) indexCountPair;
> >>      }
> >>
> >> This will almost certainly be a cause for bugs/issues from users.
> >>
> >> I believe that the counts will be used for 2 main use cases:
> >>
> >> 1. Storage
> >>
> >> 2. Adding to another counting Bloom filter
> >>
> >> Both cases are likely to be done serially and not in parallel. So
> >> providing a consumer based API to receive the counts would work.
> >>
> >> WDYT?
> >>
> >>>> Claude
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert <alex.d.herb...@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>>> On 1 Mar 2020, at 15:39, Claude Warren <cla...@xenei.com> wrote:
> >>>>>>
> >>>>>> I think the CountingBloomFilter interface needs to extend
> BloomFilter.
> >>>>> I said that but did not write it, sorry.
> >>>>>
> >>>>>> I think I am confused.
> >>>>>>
> >>>>>> I would expect CountingBloomFilter to have
> >>>>>>
> >>>>>> interface CountingBloomFilter extends BloomFilter {
> >>>>>>      // these 2 methods are the reverse of merge()
> >>>>>>      void remove( BloomFilter );
> >>>>>>      void remove( Hasher );
> >>>>> Fine. Same intention but different method names. But why void? It
> >> forces
> >>>>> you to check if the remove was valid with a second call. On the plus
> >> side
> >>>>> it matches the void merge(…) methods and in most cases a user would
> not
> >>>>> care to check anyway. If they are controlling the filter then leave
> it
> >> to
> >>>>> them to make sure they do not remove something they did not add.
> >>>>>
> >>>>>>      // these 2 methods are the addition/subtraction of counts
> >>>>>>      void add( CountingBloomFilter )
> >>>>>>      void subtract( CountingBloomFilter );
> >>>>> Fine. But same comment as above with the void return.
> >>>>>
> >>>>>>      // 2 methods to retrieve data
> >>>>>>      Stream<long[]> getCounts();
> >>>>> I don’t like this use of long[]. In my previous post I argued that if
> >> you
> >>>>> were to ever want to store more than max integer items in a filter
> >> then the
> >>>>> Bloom filter would have more bit indices than max integer. So we
> never
> >> have
> >>>>> to support long counts. A filter that exceeds max integer for a count
> >> is
> >>>>> highly likely to be saturated and no use as a filter anyway.
> >>>>>
> >>>>> For most backing implementations the object type of the stream will
> be
> >>>>> different so you will have to write a Spliterator<T> implementation
> or
> >> wrap
> >>>>> some iterator anyway. So why not return the Spliterator:
> >>>>>
> >>>>>      Spliterator<int[]> getCounts();
> >>>>>
> >>>>> Since the backing implementation will likely not store int[] pairs
> then
> >>>>> this will have a lot of object creation and garbage collection
> >> overhead to
> >>>>> go through the counts. This does not seem to be a big concern here if
> >> the
> >>>>> purpose is the same as for the BloomFilter for long[] getBits(), i.e.
> >> to
> >>>>> get a canonical representation for storage.
> >>>>>
> >>>>> Note: The Spliterator may not have a known size (number of non zero
> >> bits)
> >>>>> at creation, for example if the counts are stored in a fixed size
> >> array.
> >>>>> Thus efficient parallel traversal by binary tree splitting is limited
> >> by
> >>>>> how evenly the counts are distributed. For a backing implementation
> >> using a
> >>>>> collection then the size should be known. In this case a Spliterator
> >> would
> >>>>> be of more use than a plain Iterator. You can convert one to the
> other
> >>>>> using:
> >>>>>
> >>>>> java.util.Spliterators:
> >>>>> public static<T> Iterator<T> iterator(Spliterator<? extends T>
> >>>>> spliterator)
> >>>>> public static <T> Spliterator<T> spliterator(Iterator<? extends T>
> >>>>> iterator,
> >>>>>                                                    long size,
> >>>>>                                                    int
> characteristics)
> >>>>>
> >>>>> So which to choose for the API?
> >>>>>
> >>>>> The Hasher currently uses an Iterator:
> >>>>>
> >>>>> PrimitiveIterator.OfInt getBits(Shape shape);
> >>>>>
> >>>>> In the case of a StaticHasher this could return a spliterator. But
> the
> >>>>> DynamicHasher needs a reworking of the internal Iterator class. It
> >> could be
> >>>>> a Spliterator to use the new IntConsumer API but in most (all) cases
> >>>>> splitting would not be possible for dynamic hashing as the parts are
> >>>>> produced in order. It is likely that they will be consumed
> >> sequentially too.
> >>>>> I would suggest that Spliterator is the more modern implementation,
> >>>>> despite not always being applicable to parallelisation in a stream.
> >>>>> Currently the iterator from the Hasher is used in forEachRemaining()
> >> and
> >>>>> while loop is approximately equal measure. The while loops are for a
> >> fast
> >>>>> exit and would be uglier if rewritten for a
> >>>>> Spliterator.tryAdvance(IntConsumer) syntax.
> >>>>>
> >>>>> There is a use of the IteratorChain in HasherBloomFilter that would
> >> need
> >>>>> a rethink for spliterators.
> >>>>>
> >>>>> The path of least resistance is to use Iterator<int[]> for the API of
> >>>>> CountingBloomFilter to be consistent with Hasher’s use of Iterator.
> >>>>>
> >>>>> WDYT?
> >>>>>
> >>>>>
> >>>>>>      boolean isValid()
> >>>>> Fine. Allows some level of feedback that the counts are corrupt.
> >>>>>
> >>>>>> }
> >>>>>>
> >>>>>> Claude
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <
> alex.d.herb...@gmail.com
> >>>>> <mailto:alex.d.herb...@gmail.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>> On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> wrote:
> >>>>>>>>
> >>>>>>>> The idea of a backing array is fine and the only problem I see
> with
> >>>>> it is
> >>>>>>>> in very large filters (on the order of 10^8 bits and larger) but
> >>>>> document
> >>>>>>>> the size calculation and let the developer worry about it.
> >>>>>>> Let us look at the use case where we max out the array. Using the
> >> Bloom
> >>>>>>> filter calculator:
> >>>>>>>
> >>>>>>> n = 149,363,281
> >>>>>>> p = 0.001000025 (1 in 1000)
> >>>>>>> m = 2147483647 (256MiB)
> >>>>>>> k = 10
> >>>>>>>
> >>>>>>> n = 74,681,641
> >>>>>>> p = 0.000001 (1 in 999950)
> >>>>>>> m = 2147483647 (256MiB)
> >>>>>>> k = 20
> >>>>>>>
> >>>>>>> n = 49,787,761
> >>>>>>> p = 0.000000001 (1 in 999924899)
> >>>>>>> m = 2147483647 (256MiB)
> >>>>>>> k = 30
> >>>>>>>
> >>>>>>> So you will be able to put somewhere in the order of 10^8 or 10^7
> >> items
> >>>>>>> into the filter. I would say that anyone putting more than that
> into
> >>>>> the
> >>>>>>> filter has an unusual use case. The CountingBloomFilter can throw
> an
> >>>>>>> exception if m is too large and will throw an OutOfMemoryError if
> you
> >>>>>>> cannot allocate an array large enough.
> >>>>>>>
> >>>>>>> One clear point here is that you cannot use a short as a 16-bit
> count
> >>>>>>> would easily overflow. So you have to use an integer array for the
> >>>>> counts.
> >>>>>>> A maximum length int[] is roughly 8GB.
> >>>>>>>
> >>>>>>> What would another implementation cost in terms of memory? The
> >>>>>>> TreeMap<Integer, Integer> was the most space efficient. In the
> >> previous
> >>>>>>> e-mail the saturation of a Bloom filter bits was approximately 50%
> >>>>> when at
> >>>>>>> the intended capacity. So we have to estimate the size of a TreeMap
> >>>>>>> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test
> shows
> >>>>> the
> >>>>>>> TreeMap memory scales linearly with entries:
> >>>>>>>
> >>>>>>> 32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =  1834061
> >>>>> bytes
> >>>>>>> 65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =  3669080
> >>>>> bytes
> >>>>>>> 131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =  7339090
> >>>>> bytes
> >>>>>>> So what is the memory for a TreeMap with 2^30 indices. I make it
> >> about:
> >>>>>>> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
> >>>>>>>
> >>>>>>> I would say that this amount of RAM is unusual. It is definitely
> not
> >> as
> >>>>>>> efficient as using an array. So very large counting Bloom filters
> are
> >>>>> going
> >>>>>>> to require some thought as to the hardware they run on. This may
> not
> >>>>> be the
> >>>>>>> case in 10 years time.
> >>>>>>>
> >>>>>>> I would say that we try an int[] backing array for the storage
> >>>>>>> implementation and document it’s limitations. A different
> >>>>> implementation
> >>>>>>> could be provided in future if required.
> >>>>>>>
> >>>>>>> This could be done by making CountingBloomFilter an interface that
> >>>>> extends
> >>>>>>> BloomFilter with the methods:
> >>>>>>>
> >>>>>>> subtract(BloomFilter filter)
> >>>>>>> subtract(Hasher filter)
> >>>>>>>
> >>>>>>> These will negate the effect of the corresponding
> merge(BloomFilter)
> >>>>>>> operation.
> >>>>>>>
> >>>>>>> Do we also need access to the counts and add/subtract of another
> >>>>>>> CountingBloomFilter?:
> >>>>>>>
> >>>>>>> add(CountingBloomFilter filter);
> >>>>>>> subtract(CountingBloomFilter filter);
> >>>>>>>
> >>>>>>> Iterator<int[]> getCounts();
> >>>>>>> int getSize(); // Number of items added
> >>>>>>>
> >>>>>>> The CountingBloomFilter is then an interface that defines how to
> >>>>> reverse
> >>>>>>> the merge of some bits into the filter.
> >>>>>>>
> >>>>>>> My concern is the inefficiency of the creation of objects in any
> >> method
> >>>>>>> that provides access to the counts (e.g. above using an iterator as
> >> for
> >>>>>>> Hasher.getBits()). I presume this method would be to allow some
> type
> >> of
> >>>>>>> storage/serialisation of the filter akin to the long[] getBits()
> >>>>> method of
> >>>>>>> BloomFilter. So it may be better done using a method:
> >>>>>>>
> >>>>>>> int getCount(int index);
> >>>>>>>
> >>>>>>> The caller can then use long[] getBits() to get the indices set in
> >> the
> >>>>>>> filter and then for each non-zero bit index call getCount(index).
> Or
> >>>>> just
> >>>>>>> not include the method as the counts are only of concern when
> storing
> >>>>> the
> >>>>>>> filter. This functionality is cleaner pushed into an
> implementation.
> >>>>>>>
> >>>>>>> In a previous post we discussed whether to throw an exception on
> >>>>>>> overflow/underflow or raise in an invalid flag. Using the invalid
> >> flag
> >>>>> idea
> >>>>>>> the interface would be:
> >>>>>>>
> >>>>>>> interface CountingBloomFilter {
> >>>>>>>      int add(CountingBloomFilter filter);
> >>>>>>>      int subtract(BloomFilter filter);
> >>>>>>>      int subtract(Hasher filter);
> >>>>>>>      int subtract(CountingBloomFilter filter);
> >>>>>>>      int getStatus();
> >>>>>>>
> >>>>>>>      // Maybe
> >>>>>>>      int getSize();
> >>>>>>>      int getCount(int index);
> >>>>>>> }
> >>>>>>>
> >>>>>>> The status would be negative if any operation
> overflowed/underflowed,
> >>>>> or
> >>>>>>> zero if OK. The current status is returned by the add/subtract
> >> methods.
> >>>>>>> However I note that overflow may not be a concern. The number of
> >> items
> >>>>> to
> >>>>>>> add to a filter to create overflow would be using a filter with a
> >>>>> number of
> >>>>>>> bits that is unrealistic to store in memory:
> >>>>>>>
> >>>>>>> n = 2147483647
> >>>>>>> p = 0.001000025 (1 in 1000)
> >>>>>>> m = 30875634182 (3.59GiB)
> >>>>>>> k = 10
> >>>>>>>
> >>>>>>> If you want to add 2 billion items (and overflow an integer count)
> >> then
> >>>>>>> your filter would be so big it would break the rest of the API that
> >>>>> uses a
> >>>>>>> 32-bit int for the bit index.
> >>>>>>>
> >>>>>>> Thus only underflow is a realistic concern. This could be
> documented
> >> as
> >>>>>>> handled in an implementation specific manner (i.e. throw or
> ignore).
> >>>>> The
> >>>>>>> API is then simplified to:
> >>>>>>>
> >>>>>>> interface CountingBloomFilter {
> >>>>>>>      boolean add(CountingBloomFilter filter);
> >>>>>>>      boolean subtract(BloomFilter filter);
> >>>>>>>      boolean subtract(Hasher filter);
> >>>>>>>      boolean subtract(CountingBloomFilter filter);
> >>>>>>>      int getStatus();
> >>>>>>>
> >>>>>>>      // Maybe
> >>>>>>>      int getSize();
> >>>>>>>      int getCount(int index);
> >>>>>>> }
> >>>>>>>
> >>>>>>> The boolean is used to state that add/subtract did not
> >> over/underflow.
> >>>>>>> Implementations can throw if they require it.
> >>>>>>>
> >>>>>>> The question then becomes what does getSize() represent if an
> >>>>> add/subtract
> >>>>>>> did not execute cleanly. Under this scheme it would be the number
> of
> >>>>> (add -
> >>>>>>> subtract) operations. The status flag would be used to indicate if
> >> the
> >>>>> size
> >>>>>>> is valid, or any of the counts from getCount(). The simpler API is
> to
> >>>>> not
> >>>>>>> allow access to counts/size or adding/subtracting counts:
> >>>>>>>
> >>>>>>> interface CountingBloomFilter {
> >>>>>>>      boolean subtract(BloomFilter filter);
> >>>>>>>      boolean subtract(Hasher filter);
> >>>>>>>      int getStatus();
> >>>>>>>      // Or something like ...
> >>>>>>>      boolean isValid();
> >>>>>>> }
> >>>>>>>
> >>>>>>> This filter is then only concerned with reversing the merge of
> Bloom
> >>>>>>> filters with a valid status flag to indicate that the current state
> >> is
> >>>>>>> consistent (i.e. all filters have been cleanly merged/subtracted).
> >>>>>>>
> >>>>>>> WDYT?
> >>>>>>>
> >>>>>>>> As for the merge question.  merge is a standard bloom filter
> >>>>> operation.
> >>>>>>> It
> >>>>>>>> is well defined in the literature.  Merging a bloom filter into a
> >>>>>>> counting
> >>>>>>>> bloom filter means incrementing the bit counts.  I  think that
> >>>>>>> merge/remove
> >>>>>>>> should continue to operate as though the parameter were a standard
> >>>>> bloom
> >>>>>>>> filter.
> >>>>>>>>
> >>>>>>> OK. So the count is to represent the number of filters that had a
> bit
> >>>>> set
> >>>>>>> at that index. This makes it more clear.
> >>>>>>>
> >>>>>>>> We had spoken of implementing and adding/deleting method pair that
> >>>>> would
> >>>>>>>> operate on CountingBloomFilters and would add/subtract the counts.
> >>>>> (e.g.
> >>>>>>>> add(CountingBloomFilter) and subtract(CountingBloomFilter))
> >>>>>>>>
> >>>>>>>> I disagree with your proposal for the merge(Hasher)
> implementation,
> >>>>> and I
> >>>>>>>> am not certain that an add(Hasher) makes sense.  First consider
> that
> >>>>> the
> >>>>>>>> Hasher returns the bits that are to be enabled in the Bloom filter
> >> so
> >>>>>>>> collisions are expected.  In the normal course of events a Hasher
> is
> >>>>> used
> >>>>>>>> to create a normal Bloom filter where all the duplicates are
> >> removed.
> >>>>>>> That
> >>>>>>>> filter is then merged into a CountingBloomFilter.  So in some
> sense
> >>>>> the
> >>>>>>>> Hasher and the normal Bloom filter are the same.  So I would
> expect
> >>>>> the
> >>>>>>>> merge of the Hasher and the merge of the normal Bloom filter
> created
> >>>>> from
> >>>>>>>> that hasher into a CountingBloomFilter to yield the same result.
> If
> >>>>> you
> >>>>>>>> wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
> >>>>>>>> CountingBloomFilter you could implement with duplicate counting,
> but
> >>>>> I am
> >>>>>>>> not certain of the validity of such a count and I fear that it
> >> muddies
> >>>>>>> the
> >>>>>>>> waters with respect to what the CountingBloomFilter is counting.
> >>>>>>> Agreed.
> >>>>>>>
> >>>>>>>> Claude
> >>>>>>>>
> >>>>>>>> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <
> >>>>> alex.d.herb...@gmail.com
> >>>>>>> <mailto:alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com
> >>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com
> >> <mailto:
> >>>>> cla...@xenei.com> <mailto:
> >>>>>>> cla...@xenei.com <mailto:cla...@xenei.com>>> wrote:
> >>>>>>>>>> Alex would you take a look at pull request 131 [1].  it adds a
> new
> >>>>>>> hasher
> >>>>>>>>>> implementation and makes the HashFunctionValidator available for
> >>>>> public
> >>>>>>>>> use.
> >>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>> https://github.com/apache/commons-collections/pull/131> <
> >>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>> https://github.com/apache/commons-collections/pull/131>> <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>> https://github.com/apache/commons-collections/pull/131> <
> >>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>> https://github.com/apache/commons-collections/pull/131>>>
> >>>>>>>>> OK. I’ll take a look.
> >>>>>>>>>
> >>>>>>>>> I’ve been thinking about the counting Bloom filter and the
> backing
> >>>>>>>>> storage. In summary:
> >>>>>>>>>
> >>>>>>>>> 1. The backing storage should be a fixed array.
> >>>>>>>>> 2. Merging a Hasher should count duplicate indices, not flatten
> >> them
> >>>>> all
> >>>>>>>>> to a single count.
> >>>>>>>>>
> >>>>>>>>> For background I’ve used the standard formulas to estimate the
> >>>>> number of
> >>>>>>>>> indices that will be non-zero in a Bloom filter. The wikipedia
> page
> >>>>>>> gives
> >>>>>>>>> this formula for the expected number of bits set to 0 (E(q)) if
> you
> >>>>> have
> >>>>>>>>> inserted i elements into a filter of size m using k hash
> functions:
> >>>>>>>>>
> >>>>>>>>> E(q) = (1 - 1/m)^ki"
> >>>>>>>>>
> >>>>>>>>> So a rough guess of the number of indices (bits) used by a filter
> >> is
> >>>>>>>>> 1-E(q).
> >>>>>>>>>
> >>>>>>>>> Here is a table of Bloom filters with different collision
> >>>>> probabilities
> >>>>>>>>> and the proportion of bits that will be set when 1%, 10%, 100% of
> >> the
> >>>>>>>>> capacity of the filter has been met:
> >>>>>>>>>
> >>>>>>>>> n       p       m       k       I       E(q)    bits
> >>>>>>>>> 1000    1E-04   19171   13      10      0.9932  0.0068
> >>>>>>>>> 1000    1E-04   19171   13      100     0.9344  0.0656
> >>>>>>>>> 1000    1E-04   19171   13      1000    0.5076  0.4924
> >>>>>>>>> 1000    1E-05   23963   17      10      0.9929  0.0071
> >>>>>>>>> 1000    1E-05   23963   17      100     0.9315  0.0685
> >>>>>>>>> 1000    1E-05   23963   17      1000    0.4919  0.5081
> >>>>>>>>> 1000    1E-06   28756   20      10      0.9931  0.0069
> >>>>>>>>> 1000    1E-06   28756   20      100     0.9328  0.0672
> >>>>>>>>> 1000    1E-06   28756   20      1000    0.4988  0.5012
> >>>>>>>>> 10000   1E-06   287552  20      100     0.9931  0.0069
> >>>>>>>>> 10000   1E-06   287552  20      1000    0.9328  0.0672
> >>>>>>>>> 10000   1E-06   287552  20      10000   0.4988  0.5012
> >>>>>>>>>
> >>>>>>>>> The point is that if you create a Bloom filter and fill it to 10%
> >> of
> >>>>> the
> >>>>>>>>> intended capacity the number of indices used will be about 6-7%
> of
> >>>>> the
> >>>>>>>>> filter bits.
> >>>>>>>>>
> >>>>>>>>> So how to store the counts? Currently the counting bloom filter
> >> uses
> >>>>> a
> >>>>>>>>> TreeMap<Integer, Integer>. I tried:
> >>>>>>>>>
> >>>>>>>>> TreeMap<Integer, Integer>
> >>>>>>>>> HashMap<Integer, Integer>
> >>>>>>>>> TreeSet<MutableCount>
> >>>>>>>>> HashSet<MutableCount>
> >>>>>>>>> int[]
> >>>>>>>>>
> >>>>>>>>> The MutableCount is a custom object that stores the bit index and
> >>>>> uses
> >>>>>>> it
> >>>>>>>>> for a hash code and then has a mutable integer count field. It
> >> allows
> >>>>>>> the
> >>>>>>>>> count to be incremented/decremented if the object is in the set:
> >>>>>>>>>
> >>>>>>>>>     static final class MutableCount implements
> >>>>> Comparable<MutableCount> {
> >>>>>>>>>         final int key;
> >>>>>>>>>         int count;
> >>>>>>>>>         // etc
> >>>>>>>>>     }
> >>>>>>>>>
> >>>>>>>>> This is adapted from the Bag<T> collection which stores an item
> >> count
> >>>>>>> with
> >>>>>>>>> a MutableInteger. Here the mutable count is part of the same
> >> object T
> >>>>>>>>> inserted into the Set. So you can find the object, change the
> count
> >>>>> and
> >>>>>>> not
> >>>>>>>>> have to put it back into the set. This is more efficient than
> >>>>> changing
> >>>>>>> the
> >>>>>>>>> Integer stored in a Map.
> >>>>>>>>>
> >>>>>>>>> I’ve estimated memory usage using an idea based on this article
> >> from
> >>>>>>>>> JavaWorld: Java Tip 130: Do you know your data size? [1].
> >>>>>>>>>
> >>>>>>>>> Basically you:
> >>>>>>>>>
> >>>>>>>>> - create an object and throw it away. All classes are then
> >>>>> initialised.
> >>>>>>>>> - Then you free memory (run garbage collection) and get the
> current
> >>>>>>> memory
> >>>>>>>>> usage
> >>>>>>>>> - Then create a lot of your object (held in an array)
> >>>>>>>>> - Then measure memory usage again
> >>>>>>>>> - memory = (after - before) / count
> >>>>>>>>>
> >>>>>>>>> Here is some example output for n bits set in size m:
> >>>>>>>>>
> >>>>>>>>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =
>  733947
> >>>>> bytes
> >>>>>>>>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =
> 1467866
> >>>>> bytes
> >>>>>>>>> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =
>  838928
> >>>>> bytes
> >>>>>>>>> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =
> 1677776
> >>>>> bytes
> >>>>>>>>> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =
> 1677712
> >>>>> bytes
> >>>>>>>>> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =
> 2306739
> >>>>> bytes
> >>>>>>>>> 13107 / 262144 (0.050) : HashSet<MutableCount>          =
> 1782664
> >>>>> bytes
> >>>>>>>>> 26214 / 262144 (0.100) : HashSet<MutableCount>          =
> 2516656
> >>>>> bytes
> >>>>>>>>>      0 / 262144 (0.000) : int[]                          =
> 1048608
> >>>>> bytes
> >>>>>>>>>      0 / 262144 (0.000) : short[]                        =
>  524320
> >>>>> bytes
> >>>>>>>>> The estimate is accurate to 0.0001% for the arrays so the method
> is
> >>>>>>>>> working. The HashMap was created with the capacity set to the
> >>>>> expected
> >>>>>>>>> capacity of the filter (m).
> >>>>>>>>>
> >>>>>>>>> I’ve chosen these sizes because at 5% full a HashSet is less
> memory
> >>>>>>>>> efficient than using a fixed size array, and at 10% the TreeSet
> is
> >>>>> also
> >>>>>>>>> less efficient.
> >>>>>>>>>
> >>>>>>>>> Note that the java.util.Tree/HashSet versions just wrap a Map and
> >>>>>>> insert a
> >>>>>>>>> dummy object for all keys in the Map. So here a Set is not as
> >>>>> efficient
> >>>>>>> as
> >>>>>>>>> a Map because in the Map test I always inserted the same Integer
> >>>>> object
> >>>>>>>>> representing 1. This would be the same as using a Set with an
> >> Integer
> >>>>>>> key
> >>>>>>>>> but here the Set had to contain the MutableCount which has an
> extra
> >>>>> int
> >>>>>>>>> field and is larger than an Integer.
> >>>>>>>>>
> >>>>>>>>> These data lead me to think that a counting Bloom filter should
> >> just
> >>>>>>> use a
> >>>>>>>>> fixed size backing array:
> >>>>>>>>>
> >>>>>>>>> - If created using the same Shape as a standard Bloom filter it
> >> uses
> >>>>> a
> >>>>>>>>> fixed size. This has high memory cost when the filter is empty
> but
> >>>>> when
> >>>>>>> it
> >>>>>>>>> exceeds 10% of the intended capacity it is more efficient than a
> >>>>> dynamic
> >>>>>>>>> backing storage.
> >>>>>>>>> - All operations will operate in order(n) time for an operation
> >> with
> >>>>>>>>> another filter with n indices. Each individual index count in the
> >>>>> filter
> >>>>>>>>> will have order(1) time for access/update. Performance will be
> >>>>> limited
> >>>>>>> by
> >>>>>>>>> the memory cache of the entire array.
> >>>>>>>>>
> >>>>>>>>> The issue is that a counting Bloom filter with a very low number
> of
> >>>>>>>>> inserted items will be memory inefficient. But under what
> >>>>> circumstance
> >>>>>>> will
> >>>>>>>>> such a filter be used in a short-term lifecycle? If it is simply
> to
> >>>>>>> merge
> >>>>>>>>> into another filter then this can be done using a merge with a
> >>>>> Hasher.
> >>>>>>> If
> >>>>>>>>> counts are to be merged then perhaps we provide a method to merge
> >>>>> counts
> >>>>>>>>> using the same data structure returned by the CountingBloomFilter
> >>>>>>>>> getCounts() method, e.g. using a stream of <index,count> pairs:
> >>>>>>>>>
> >>>>>>>>> Stream<int[]> getCounts();
> >>>>>>>>> void add(Stream<int[]> counts);
> >>>>>>>>>
> >>>>>>>>> The issue here is the Shape and HashFunctionIdentity of the
> origin
> >> of
> >>>>>>> the
> >>>>>>>>> merge cannot be validated. So just leave it out and use the merge
> >>>>> with a
> >>>>>>>>> Hasher.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thus the next issue with the counting Bloom filter
> implementation.
> >>>>>>>>> Currently when it merges with a Hasher it puts all the indices
> >> into a
> >>>>>>> Set
> >>>>>>>>> and so will only increment the count by 1 for each index
> identified
> >>>>> by
> >>>>>>> the
> >>>>>>>>> Hasher. This appears to miss the entire point of the counting
> Bloom
> >>>>>>> filter.
> >>>>>>>>> If I hash an objects to generate k indices I would hope that I do
> >>>>> get k
> >>>>>>>>> indices. But the hash may not be perfect and I may get [1, k]
> >> indices
> >>>>>>> with
> >>>>>>>>> some duplications. This is part of the signature of that object
> >> with
> >>>>> the
> >>>>>>>>> given hash. So surely a counting Bloom filter should accommodate
> >>>>> this.
> >>>>>>> If
> >>>>>>>>> my Hasher generates the same index 20 times this should result in
> >> the
> >>>>>>> count
> >>>>>>>>> of that index incrementing by 20.
> >>>>>>>>>
> >>>>>>>>> The result if that if an object is added direct to a counting
> Bloom
> >>>>>>> filter
> >>>>>>>>> using a Hasher it will have a different result that if added to a
> >>>>>>> standard
> >>>>>>>>> Bloom filter and then that filter added to the counting Bloom
> >> filter.
> >>>>>>>>> Opinions on this?
> >>>>>>>>>
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [1]
> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
> >> <
> >>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html>
> >>>>>>>>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <
> >>>>>>> alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I have created a PR that contains most of the changes discussed
> >> in
> >>>>>>> this
> >>>>>>>>>>> thread (see [1]).
> >>>>>>>>>>>
> >>>>>>>>>>> Please review the changes and comment here or on GitHub.
> >>>>>>>>>>>
> >>>>>>>>>>> I have left the CountingBloomFilter alone. I will reimplement
> the
> >>>>>>>>>>> add/subtract functionality as discussed into another PR.
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>>>>
> >>>>>>>>>>> [1] https://github.com/apache/commons-collections/pull/133 <
> >>>>> https://github.com/apache/commons-collections/pull/133> <
> >>>>>>>>>>> https://github.com/apache/commons-collections/pull/133 <
> >>>>> https://github.com/apache/commons-collections/pull/133>>
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> >>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>> http://www.linkedin.com/in/claudewarren>
> >>>>>>>> --
> >>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/> <
> >>>>> http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
> >>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>> http://www.linkedin.com/in/claudewarren> <
> >>>>>>> http://www.linkedin.com/in/claudewarren <
> >>>>> http://www.linkedin.com/in/claudewarren>>
> >>>>>> --
> >>>>>> I like: Like Like - The likeliest place on the web
> >>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> >>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>> http://www.linkedin.com/in/claudewarren>
> >>>>>
> >>>> --
> >>>> I like: Like Like - The likeliest place on the web
> >>>> <http://like-like.xenei.com>
> >>>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >> For additional commands, e-mail: dev-h...@commons.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Re: [collections] Bloom filters

Reply via email to