Re: [collections] Bloom filters

Claude Warren Tue, 03 Mar 2020 14:32:24 -0800

take a look at
https://github.com/apache/commons-collections/pull/131/files#diff-8b2bf046dc35c88908eef196937173e1


This is a different Hasher with a smaller data footprint than
DynamicHasher.  Like the DynamicHasher it builds the values on the fly.
Can SplitIterators be implemented on them easily?

I am torn between the interesting uses of a SplitIterator and the well
known usage pattern for an iterator.  Perhaps we should implement both.  It
seems to me (without really looking) that we should be able to implement an
Iterator that uses the SplitIterator.

Thoughts?

On Tue, Mar 3, 2020 at 11:54 AM Alex Herbert <[email protected]>
wrote:

>
> On 02/03/2020 22:34, Claude Warren wrote:
> > So what we have then is:
> >
> > *public* *interface* BloomFilter {
> >
> >       *int* andCardinality(BloomFilter other);
> >
> >       *int* cardinality();
> >
> >       *boolean* contains(BloomFilter other);
> >
> >       *boolean* contains(Hasher hasher);
> >
> >       *long*[] getBits();
> >
> >     // Change
> >     PrimitiveIterator.OfInt iterator();
> >
> >       Shape getShape();
> >
> >
> > *   // Change   boolean* merge(BloomFilter other);
> >
> >
> > *// Change   boolean* merge(Hasher hasher);
> >
> >       *int* orCardinality(BloomFilter other);
> >
> >       *int* xorCardinality(BloomFilter other);
> >
> > }
> >
> > public interface BitCountConsumer {
> >      boolean accept(int index, int count);
> > }
> >
> > public interface CountingBloomFilter extends BloomFilter {
> >
> >
> > *// Change  boolean* remove(BloomFilter other);
> >
> >      // Change
> >      *boolean* remove(Hasher hasher);
> >
> >      // Change
> >    boolean add(CountingBloomFilter other);
> >
> >      // Change
> >      boolean subtract(CountingBloomFilter other);
> >
> >      // Change
> >     void getCounts(BitCountConsumer consumer);
> >
> > }
> >
> > *public* *final* *class* StaticHasher *implements* Hasher {
> >
> >       *public* StaticHasher(*final* Hasher hasher, *final* Shape shape)
> { ...
> > }
> >
> >       *public* StaticHasher(*final* Iterator<Integer> iter, *final* Shape
> > shape) { ... }
> >
> >       *public* StaticHasher(*final* Collection<Integer> collection,
> *final*
> > Shape shape) { ... }
> >
> >      *public* Shape getShape() { *... *}
> >
> >      *public* *int* size() { *... *}
> >
> >
> > *// Change  public* *int* max() { *... *}
> >
> > }
> >
> > *public* *final* *class* FixedHasher *implements* Hasher {
> >
> >      *public* FixedHasher(*final* Hasher hasher, *final* Shape shape) {
> ... }
> >
> >      *public* FixedHasher(*final* Iterator<Integer> iter, *final* Shape
> shape)
> > { ... }
> >
> >      *public* FixedHasher(*final* Collection<Integer> collection, *final*
> > Shape shape) { ... }
> >
> >      *public* Shape getShape() { *... *}
> >
> >      *public* *int* size() { *... *}
> >
> >      *public* *int* max() { *... *}
> >
> > }
> >
> > I have a pull request in to add a CachingHasher.
> >
> > While we are at it, I think we should change Hasher.getBits(Shape) to
> > Hasher.iterator(Shape) to match the BloomFilter.iterator() and to limit
> > confusion between Hasher.getBits( Shape ) (return iterator) and
> > BloomFilter.getBits() returns (long[])
> >
> > The interface changes look good to me.  I agree with your implementation
> > ideas.
> >
> > I do have one use case where I need the BloomFilter.iterator() results in
> > reverse order (I am estimating Log2 of the filter), but I can do that
> with
> > the methods here.  I don't think any of these changes significantly
> impacts
> > the uses that I currently have.
> >
> > Shall we move forward?
> >
> > Claude
>
> I finished a first draft of the CountingBloomFilter. I still have the
> tests to add for the new API.
>
> When working with this I had to write an Iterator (to create a Hasher
> for the soon to be changed API) which has very few lines of code, most
> of which are repeats due to the separation of hasNext() and nextInt()
> and the requirement to be able to call hasNext() repeatedly without
> calling nextInt(). This mandates storing the next value rather than a
> current position to search for a non-zero index:
>
>      static class IndexIterator implements PrimitiveIterator.OfInt {
>          int next;
>
>          IndexIterator() {
>              advance(0);
>          }
>
>          private void advance(int from) {
>              next = from;
>              while (next < counts.length && counts[next] == 0) {
>                  next++;
>              }
>          }
>
>          @Override
>          public boolean hasNext() {
>              return next < counts.length;
>          }
>
>          @Override
>          public int nextInt() {
>              if (hasNext()) {
>                  final int result = next;
>                  advance(next + 1);
>                  return result;
>              }
>              throw new NoSuchElementException();
>          }
>      }
>
> This is noted in the Spliterator javadoc:
>
>   * <p>Spliterators, like {@code Iterator}s, are for traversing the
> elements of
>   * a source.  The {@code Spliterator} API was designed to support
> efficient
>   * parallel traversal in addition to sequential traversal, by supporting
>   * decomposition as well as single-element iteration.  In addition, the
>   * protocol for accessing elements via a Spliterator is designed to impose
>   * smaller per-element overhead than {@code Iterator}, and to avoid the
> inherent
>   * race involved in having separate methods for {@code hasNext()} and
>   * {@code next()}.
>
> The same code written as a Spliterator has a much simpler tryAdvance
> method with only a single comparison of the current index to the upper
> limit per access of each element.
>
>      static class IndexSpliterator implements Spliterator.OfInt {
>          int index;
>
>          @Override
>          public long estimateSize() {
>              return Long.MAX_VALUE;
>          }
>
>          @Override
>          public int characteristics() {
>              return Spliterator.SORTED | Spliterator.ORDERED |
>                     Spliterator.DISTINCT | Spliterator.NONNULL;
>          }
>
>          @Override
>          public OfInt trySplit() {
>              return null;
>          }
>
>          @Override
>          public boolean tryAdvance(IntConsumer action) {
>              while (index < counts.length) {
>                  final int count = counts[index++];
>                  if (count != 0) {
>                      action.accept(count);
>                      return true;
>                  }
>              }
>              return false;
>          }
>      }
>
> Or with splitting allowed (because we can). The tryAdvance() is the same
> just with a different end point:
>
>      static class IndexSpliterator implements Spliterator.OfInt {
>          int index;
>          final int end;
>
>          IndexSpliterator() {
>              this(0, counts.length);
>          }
>
>          IndexSpliterator(int start, int end) {
>              index = start;
>              this.end = end;
>          }
>
>          @Override
>          public long estimateSize() {
>              // This is an approximation.
>              // A Bloom filter that is at capacity will have approximately
>              // half of the bits set.
>              return (end - index) / 2;
>          }
>
>          @Override
>          public int characteristics() {
>              return Spliterator.SORTED | Spliterator.ORDERED |
>                     Spliterator.DISTINCT | Spliterator.NONNULL;
>          }
>
>          @Override
>          public OfInt trySplit() {
>              final int middle = (index + end) >>> 1;
>              if (middle > index) {
>                  final int start = index;
>                  index = middle;
>                  return new IndexSpliterator(start, middle);
>              }
>              return null;
>          }
>
>          @Override
>          public boolean tryAdvance(IntConsumer action) {
>              while (index < end) {
>                  final int count = counts[index++];
>                  if (count != 0) {
>                      action.accept(count);
>                      return true;
>                  }
>              }
>              return false;
>          }
>      }
>
> What do you think to changing the use of Iterator to Spliterator for
> efficiency purposes?
>
> This would add:
>
> BloomFilter {
>      Spliterator.OfInt spliterator();
> }
>
> Hasher {
>      Spliterator.OfInt spliterator(Shape shape);
> }
>
> CountingBloomFilter extends BloomFilter {
>      // ???
>      Spliterator<int[]> countSpliterator();
> }
>
> The final one is for completeness so you can use the streams API with
> the counts. I don't really like it. I'd prefer to pack the index and
> count into a long as stated previously to maximise efficiency. My
> suggestion would be to not have a spliterator for the counts and leave
> the access of <index, count> pairs to an explicit forEach consumer.
>
> Most of the current classes use a collection that provides a spliterator
> so the change is easy. The DynamicHasher Iterator would have to be
> rewritten but the Spliterator is simpler and all the code is already there.
>
> I am not sold on adding a default stream method as is done in the
> java.util.Collections API:
>
>      default IntStream stream() {
>          return StreamSupport.stream(spliterator(), false);
>      }
>
> What do you think? I hope this is not going full circle back to an
> earlier API which was previously discussed. My aim is to make the API as
> efficient as possible and so starting with Spliterator rather than
> Iterator seems to be a better approach.
> >
> >
> > On Mon, Mar 2, 2020 at 6:02 PM Alex Herbert <[email protected]>
> > wrote:
> >
> >> On 02/03/2020 16:12, Claude Warren wrote:
> >>> Does getCounts() return a snapshot of the values when the call was made
> >> or
> >>> does it return values that may be updated during the retrieval.  If
> there
> >>> are 2 threads (one reading counts and one doing a merge) it seems to me
> >>> that the "iterate over the data without constructing objects" approach
> >>> means that the data may not be internally consistent.  But then we
> don't
> >>> really check that issue in the other methods either.  So let's go with
> >> the
> >>> fail fast BitCountConsumer approach.
> >> My assumption on this is that no filter is safe for concurrent use.
> >> Allowing a merge while reading the state is a concurrency issue. I don't
> >> think we want to go down the route of having fail-fast concurrency
> >> checking using a 'modified count' strategy, i.e. all operations save the
> >> current 'modified count' at the start and increment it at the end, if it
> >> changes before the end then there has been concurrent modification. It
> >> would be simpler to state that no filters are designed to be used
> >> concurrently.
> >>
> >>> On the StaticHasher it is used for two purposes:
> >>>
> >>> 1) getting a list of enabled bits.
> >>> 2) creating a bloom filter from a list of enabled bits.  There was a
> >>> request early on for the ability to test membership in Bloom filter
> >> without
> >>> having to construct a bloom filter.  Basically, the developer has a
> list
> >> of
> >>> values and wants to check membership.
> >> I think the issue is that for 1 and 2 there are a few scenarios with
> >> different requirements since the list of enabled bits can be small or
> big.
> >>
> >> One feature of the StaticHasher is the requirement to remove duplicates
> >> and return a sorted order. This allows for an efficient storage form
> >> factor with the intension that it may be around for a while. But it is
> >> an overhead when creating a large one from a known set of non-duplicate
> >> bits, i.e. a Bloom filter.
> >>
> >> For (2) if you want to do a dynamic check to test membership then you
> >> use BloomFilter.contains(Hasher). I would pass an on-the-fly Hasher for
> >> this and not construct the entire set of indices into a StaticHasher
> >> anyway.
> >>
> >> This would be done using the DynamicHasher to build hashes on-the-fly,
> >> which may have duplicates. For the use case where you already know the
> >> bit indices but do not want to create a BloomFilter or a StaticHasher
> >> then perhaps we need another Hasher that is a wrapper around an int[]
> >> that allows duplicates.
> >>
> >> // Returns an unordered set of values that may contain duplicates.
> >>
> >>       static Hasher wrap(final int[] values, final Shape shape) {
> >>           Objects.requireNonNull(values);
> >>           Objects.requireNonNull(shape);
> >>           return new Hasher() {
> >>               @Override
> >>               public OfInt getBits(Shape sameShape) {
> >>                   if (!shape.equals(sameShape)) {
> >>                       throw new IllegalArgumentException();
> >>                   }
> >>                   return Arrays.stream(values).iterator();
> >>               }
> >>
> >>               @Override
> >>               public HashFunctionIdentity getHashFunctionIdentity() {
> >>                   return shape.getHashFunctionIdentity();
> >>               }
> >>
> >>               @Override
> >>               public boolean isEmpty() {
> >>                   return values.length == 0;
> >>               }
> >>           };
> >>       }
> >>
> >>
> >>> It might make more sense to reimplement the StaticHasher using a long[]
> >>> internally (like the BloomFilter.getBits() method returns).  We have
> the
> >>> shape so we know the maximum length and we could trim it after loading
> to
> >>> remove trailing zero value longs.  This would remove the necessity of
> >>> converting the iterator into a set of some sort.  We can do the limit
> >>> checks quickly as we turn bits on.
> >> For small numbers of indices the long[] format is likely to be less
> >> space efficient.
> >>> Makes me think we might need to implement StandardBloomFilter to use
> >> long[]
> >>> as well.
> >> That is just the BitSetBloomFilter. So if you use this format in the
> >> StaticHasher then you have no advantage over actually creating a
> >> BitSetBloomFilter.
> >>
> >> The StaticHasher is returned by the BloomFilter interface. You also have
> >> a method to return the bits using the long[] format. So there is
> >> redundancy here. To drop the redundancy I think you separate the uses
> >> for the canonical representation of the entire filter as a long[] and
> >> the uses for the set bits. If the later is satisfied by the StaticHasher
> >> then it should be replaced by an iterator instead.
> >>
> >> Looking at the code the getHasher() method in BloomFilter is only used
> >> in HasherBloomFilter. This is to add more enabled bits to the current
> >> set it contains. It seems that this use case is better served by using a
> >> Set as the backing storage.
> >>
> >> This leaves the StaticHasher with no usage in the current library. It is
> >> not as fast as wrapping a provided array. It offers no expansion of
> >> storage for use in the HasherBloomFilter. It does however provide the
> >> smallest form for a compact BloomFilter representation. So it could be
> >> renamed to ImmutableHasher.
> >>
> >> I would think it best to consolidate all of this:
> >>
> >> - StaticHasher to stay as an immutable instance of sorted bit indices.
> >> The intention is an efficient compact form for a few bits of a
> >> BloomFilter than can be merged efficiently. It could be used with
> >> pre-exisiting indices but would have to eliminate duplicates and sort
> them.
> >> - Remove the redundant StaticHasher constructor accepting a StaticHasher
> >> instance since they are immutable.
> >> - Possibly add a method to StaticHasher to merge another StaticHasher to
> >> create a new instance.
> >> - Possibly add a method to StaticHasher to get the maximum index in the
> >> set (it already has size()). This helps convert it to a long[] or any
> >> other class that will call getBits() and use the iterator output.
> >>
> >> - BloomFilter to remove the getHasher() method. The long[] getBits()
> >> method to be documented as returning the canonical format of the filter.
> >> The getHasher() method is just duplicating this information in another
> >> format.
> >> - BloomFilter to have an iterator() method to get a
> >> PrimitiveIterator.OfInt. This replaces the getHasher() method which just
> >> provided access to the Shape and the PrimitiveIterator.OfInt of the
> >> Hasher. BloomFilter already has getShape() so it just needs an iterator.
> >>
> >> - HasherBloomFilter to use the iterator() method when merging a generic
> >> BloomFilter. It can detect instances of HasherBloomFilter and switch to
> >> a merge operation of its internal representation.
> >> - HasherBloomFilter to use different backing storage (e.g. a
> >> Set<Integer>) or adapt StaticHasher to be more efficient for merge.
> >>
> >> - Add a new class akin to StaticHasher that allows duplicates. Document
> >> it as a fast way to create a few bits of a BloomFilter.
> >>
> >> You then have:
> >>
> >> Hasher
> >>
> >> - StaticHasher - immutable set of ordered non-duplicate bits. Can be
> >> merged with another instance.
> >>
> >> - DynamicHasher - generates bits on the fly using a hashing function
> >>
> >> - FixedHasher - an array of indices. Can wrap an input array or be
> >> created using constructors similar to StaticHasher but without the
> >> guarantees of non-duplicates.
> >>
> >>
> >> Lots to think about there.
> >>
> >>> Claude
> >>>
> >>> On Mon, Mar 2, 2020 at 1:12 PM Alex Herbert <[email protected]>
> >>> wrote:
> >>>
> >>>> On 02/03/2020 11:32, Claude Warren wrote:
> >>>>> my thought on changing the BloomFilter.merge() to return a boolean is
> >>>> along
> >>>>> the lines you had: return true on successful merge (even if there are
> >> no
> >>>>> changes in the enabled bits). And yes, for most cases the standard
> >> bloom
> >>>>> filter will return true, but the change is really to support
> extensions
> >>>> to
> >>>>> Bloom filter so I think it is reasonable.
> >>>>>
> >>>>> As for the getCounts().  Suppose we split it into two methods:
> >>>>>
> >>>>>        // return the indexes that are enabled.  This is equivalent to
> >>>>> CountingBloomFilter.getHasher().getBits(
> CountingBloomFilter.getShape()
> >>>> );
> >>>>>        Iterator<int> getIndexes()
> >>>>>        // get the count for the specific index.
> >>>>>        int getCount( int index );
> >>>>>
> >>>>> With these methods It becomes possible to construct an iterator of
> >> int[]
> >>>> or
> >>>>> Map.Entry<Integer,Integer> or whatever else the developer wants.
> >>>> Having to call getCount() for every index is likely to be slower than
> >>>> all the garbage collection involved with iterating a disposable
> <index,
> >>>> count> pair. For an array backed storage the access will be order(1).
> >>>> But other storage may be a lot slower. This method ideally should
> >>>> traverse the storage once and make each <index, count> pair available.
> >>>> Ideally this would be without creating each <index, count> pair as a
> >>>> disposable object.
> >>>>
> >>>> Following on from previously, another disadvantage of the consumer
> >>>> approach is the lack of fast fail. You can remedy this by adding a
> >>>> boolean return from the consumer to indicate if you want to consume
> more
> >>>> items:
> >>>>
> >>>> interface BitCountConsumer {
> >>>>        boolean accept(int index, int count);
> >>>> }
> >>>>
> >>>> A CountingBloomFilter implementation then has something like:
> >>>>
> >>>>        void getCounts(BitCountConsumer consumer) {
> >>>>            while(hasNext()) {
> >>>>                next();
> >>>>                if (!consumer.accept(index(), count()) {
> >>>>                    break;
> >>>>                }
> >>>>            }
> >>>>        }
> >>>>
> >>>> Looking at the getHasher().getBits(...) idea I noticed that
> StaticHasher
> >>>> is constructed using:
> >>>>
> >>>> Iterator<Integer> iter
> >>>>
> >>>> not:
> >>>>
> >>>> PrimitiveIterator.OfInt
> >>>>
> >>>> I'd prefer the primitive specialisation in this constructor. It would
> >>>> break the HasherBloomFilter merge function but that can be fixed.
> >>>> However it may be redundant. The StaticHasher has some oddities with
> the
> >>>> available constructors:
> >>>>
> >>>> public StaticHasher(final StaticHasher hasher, final Shape shape)
> >>>>
> >>>> - why have a shape parameter? The constructor checks the shape is the
> >>>> same as that for the hasher. So why not just a plain copy constructor:
> >>>>
> >>>> public StaticHasher(final StaticHasher hasher)
> >>>>
> >>>> This constructor is only used in the unit tests. Given the
> StaticHasher
> >>>> is immutable then a copy constructor seems odd. I think it should be
> >>>> dropped.
> >>>>
> >>>>
> >>>> Here are is a constructor that is missing IMO:
> >>>>
> >>>> public StaticHasher(final Set<Integer> indices, final Shape shape)
> >>>>
> >>>> - Could be used to generically create a static hasher. The indices are
> >>>> ensured to be unique by the Set but should be checked to be within the
> >>>> shape.
> >>>>
> >>>>
> >>>> With these constructors the public StaticHasher(final
> Iterator<Integer>
> >>>> iter, final Shape shape) is only used by the BitSetBloomFilter:
> >>>>
> >>>>        public StaticHasher getHasher() {
> >>>>            return new StaticHasher(bitSet.stream().iterator(),
> >> getShape());
> >>>>        }
> >>>>
> >>>> Becomes:
> >>>>
> >>>>        public StaticHasher getHasher() {
> >>>>            return new
> >>>>
> >>
> StaticHasher(bitSet.stream().boxed().collect(Collectors.toCollection(TreeSet::new)),
> >>>> getShape());
> >>>>        }
> >>>>
> >>>> This is a very roundabout way to construct a StaticHasher. Perhaps a
> >>>> constructor accepting a BitSet would be useful?
> >>>>
> >>>> So the API only requires:
> >>>>
> >>>> public StaticHasher(final Hasher hasher, final Shape shape)
> >>>> public StaticHasher(final Set<Integer> indices, final Shape shape)
> >>>> // Maybe
> >>>> public StaticHasher(final BitSet indices, final Shape shape)
> >>>>
> >>>>
> >>>> I wondered about a merge function for the StaticHasher:
> >>>>
> >>>> public StaticHasher merge(final StaticHasher other)
> >>>>
> >>>> The merge function can make some assumptions about the two arrays to
> >>>> merge as the class is final and the values are verified to be sorted
> and
> >>>> within the Shape. However there is not a use for it other than in the
> >>>> merge of two HasherBloomFilters, which is not a recommended operation
> >>>> anyway.
> >>>>
> >>>>
> >>>>> Claude
> >>>>>
> >>>>> On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert <
> [email protected]
> >>>>> wrote:
> >>>>>
> >>>>>> On 02/03/2020 09:38, Claude Warren wrote:
> >>>>>>> It is not too late to update the BloomFIlter interface to have the
> >>>> merge
> >>>>>>> return a boolean.  The incorrect Shape would still throw an
> >> exception,
> >>>> so
> >>>>>>> the return value would only come into play if the bits could not be
> >>>> set.
> >>>>>>> thoughts?
> >>>>>> I don't see the harm in it. But what would the return value be for?
> >>>>>>
> >>>>>> For a standard collection it would be if the collection was changed
> by
> >>>>>> the operation:
> >>>>>>
> >>>>>> Collection.add/remove return "true if this collection changed as a
> >>>>>> result of the call"
> >>>>>>
> >>>>>> So here is the equivalent:
> >>>>>>
> >>>>>> return "true if this filter was changed as a result of the call"
> >>>>>>
> >>>>>> This is computationally slow to track. It also is confusing if the
> >>>>>> filter was successfully merged but no bits were changed to then
> return
> >>>>>> false because the filter was actually incorporated. So it would go
> >> along
> >>>>>> the lines that we discussed for the counting Bloom filter:
> >>>>>>
> >>>>>> return "true if this filter was successfully merged as a result of
> the
> >>>>>> call"
> >>>>>>
> >>>>>> For most cases in the current library it would be true when an
> >> exception
> >>>>>> is not thrown. However the merge of the counting Bloom filter may
> have
> >>>>>> reason to return false, e.g. overflow.
> >>>>>>
> >>>>>>> On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <[email protected]>
> >> wrote:
> >>>>>>>> for the remove(), add(), and subtract() methods I agree that void
> is
> >>>> not
> >>>>>>>> correct and it should be boolean and be the same as the value you
> >>>> would
> >>>>>> get
> >>>>>>>> from calling isValid().
> >>>>>>>>
> >>>>>>>> You are correct the getCounts() should return an iterator of some
> >> type
> >>>>>> on
> >>>>>>>> int[], I don't know why I thought long[].  I am happy with a plain
> >>>>>>>> Iterator<int[]> as the return.
> >>>>>> For the getCounts() method I am still looking for a way around
> having
> >> to
> >>>>>> create an <index, count> pair for everything in the filter. An
> >>>>>> alternative to an iterator is to use the consumer idea. Given there
> is
> >>>>>> no primitive specialisation of BiConsumer<T, U> in JDK 8 functions
> we
> >>>>>> define our own:
> >>>>>>
> >>>>>> interface BitCountConsumer {
> >>>>>>         void accept(int index, int count);
> >>>>>> }
> >>>>>>
> >>>>>> The CountingBloomFilter then has:
> >>>>>>
> >>>>>>         void forEachCount(BitCountConsumer consumer);
> >>>>>>         // Or
> >>>>>>         void getCounts(BitCountConsumer consumer);
> >>>>>>
> >>>>>>
> >>>>>> You can then directly pass the counts from the backing storage to
> the
> >>>>>> destination.
> >>>>>>
> >>>>>> Advantages:
> >>>>>> - No object creation
> >>>>>> Disadvantages
> >>>>>> - The counts cannot be streamed
> >>>>>>
> >>>>>> An alternative is to provide an Iterator of long with the index and
> >>>>>> count packed into a long with methods to extract them:
> >>>>>>
> >>>>>>        PrimativeIterator.OfLong getCounts();
> >>>>>>
> >>>>>>        default static int getIndex(long indexCountPair) {
> >>>>>>             return (int) (indexCountPair >>> 32);
> >>>>>>        }
> >>>>>>
> >>>>>>        default static int getCount(long indexCountPair) {
> >>>>>>             return (int) indexCountPair;
> >>>>>>        }
> >>>>>>
> >>>>>> This will almost certainly be a cause for bugs/issues from users.
> >>>>>>
> >>>>>> I believe that the counts will be used for 2 main use cases:
> >>>>>>
> >>>>>> 1. Storage
> >>>>>>
> >>>>>> 2. Adding to another counting Bloom filter
> >>>>>>
> >>>>>> Both cases are likely to be done serially and not in parallel. So
> >>>>>> providing a consumer based API to receive the counts would work.
> >>>>>>
> >>>>>> WDYT?
> >>>>>>
> >>>>>>>> Claude
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert <
> >> [email protected]
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> On 1 Mar 2020, at 15:39, Claude Warren <[email protected]>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I think the CountingBloomFilter interface needs to extend
> >>>> BloomFilter.
> >>>>>>>>> I said that but did not write it, sorry.
> >>>>>>>>>
> >>>>>>>>>> I think I am confused.
> >>>>>>>>>>
> >>>>>>>>>> I would expect CountingBloomFilter to have
> >>>>>>>>>>
> >>>>>>>>>> interface CountingBloomFilter extends BloomFilter {
> >>>>>>>>>>        // these 2 methods are the reverse of merge()
> >>>>>>>>>>        void remove( BloomFilter );
> >>>>>>>>>>        void remove( Hasher );
> >>>>>>>>> Fine. Same intention but different method names. But why void? It
> >>>>>> forces
> >>>>>>>>> you to check if the remove was valid with a second call. On the
> >> plus
> >>>>>> side
> >>>>>>>>> it matches the void merge(…) methods and in most cases a user
> would
> >>>> not
> >>>>>>>>> care to check anyway. If they are controlling the filter then
> leave
> >>>> it
> >>>>>> to
> >>>>>>>>> them to make sure they do not remove something they did not add.
> >>>>>>>>>
> >>>>>>>>>>        // these 2 methods are the addition/subtraction of counts
> >>>>>>>>>>        void add( CountingBloomFilter )
> >>>>>>>>>>        void subtract( CountingBloomFilter );
> >>>>>>>>> Fine. But same comment as above with the void return.
> >>>>>>>>>
> >>>>>>>>>>        // 2 methods to retrieve data
> >>>>>>>>>>        Stream<long[]> getCounts();
> >>>>>>>>> I don’t like this use of long[]. In my previous post I argued
> that
> >> if
> >>>>>> you
> >>>>>>>>> were to ever want to store more than max integer items in a
> filter
> >>>>>> then the
> >>>>>>>>> Bloom filter would have more bit indices than max integer. So we
> >>>> never
> >>>>>> have
> >>>>>>>>> to support long counts. A filter that exceeds max integer for a
> >> count
> >>>>>> is
> >>>>>>>>> highly likely to be saturated and no use as a filter anyway.
> >>>>>>>>>
> >>>>>>>>> For most backing implementations the object type of the stream
> will
> >>>> be
> >>>>>>>>> different so you will have to write a Spliterator<T>
> implementation
> >>>> or
> >>>>>> wrap
> >>>>>>>>> some iterator anyway. So why not return the Spliterator:
> >>>>>>>>>
> >>>>>>>>>        Spliterator<int[]> getCounts();
> >>>>>>>>>
> >>>>>>>>> Since the backing implementation will likely not store int[]
> pairs
> >>>> then
> >>>>>>>>> this will have a lot of object creation and garbage collection
> >>>>>> overhead to
> >>>>>>>>> go through the counts. This does not seem to be a big concern
> here
> >> if
> >>>>>> the
> >>>>>>>>> purpose is the same as for the BloomFilter for long[] getBits(),
> >> i.e.
> >>>>>> to
> >>>>>>>>> get a canonical representation for storage.
> >>>>>>>>>
> >>>>>>>>> Note: The Spliterator may not have a known size (number of non
> zero
> >>>>>> bits)
> >>>>>>>>> at creation, for example if the counts are stored in a fixed size
> >>>>>> array.
> >>>>>>>>> Thus efficient parallel traversal by binary tree splitting is
> >> limited
> >>>>>> by
> >>>>>>>>> how evenly the counts are distributed. For a backing
> implementation
> >>>>>> using a
> >>>>>>>>> collection then the size should be known. In this case a
> >> Spliterator
> >>>>>> would
> >>>>>>>>> be of more use than a plain Iterator. You can convert one to the
> >>>> other
> >>>>>>>>> using:
> >>>>>>>>>
> >>>>>>>>> java.util.Spliterators:
> >>>>>>>>> public static<T> Iterator<T> iterator(Spliterator<? extends T>
> >>>>>>>>> spliterator)
> >>>>>>>>> public static <T> Spliterator<T> spliterator(Iterator<? extends
> T>
> >>>>>>>>> iterator,
> >>>>>>>>>                                                      long size,
> >>>>>>>>>                                                      int
> >>>> characteristics)
> >>>>>>>>> So which to choose for the API?
> >>>>>>>>>
> >>>>>>>>> The Hasher currently uses an Iterator:
> >>>>>>>>>
> >>>>>>>>> PrimitiveIterator.OfInt getBits(Shape shape);
> >>>>>>>>>
> >>>>>>>>> In the case of a StaticHasher this could return a spliterator.
> But
> >>>> the
> >>>>>>>>> DynamicHasher needs a reworking of the internal Iterator class.
> It
> >>>>>> could be
> >>>>>>>>> a Spliterator to use the new IntConsumer API but in most (all)
> >> cases
> >>>>>>>>> splitting would not be possible for dynamic hashing as the parts
> >> are
> >>>>>>>>> produced in order. It is likely that they will be consumed
> >>>>>> sequentially too.
> >>>>>>>>> I would suggest that Spliterator is the more modern
> implementation,
> >>>>>>>>> despite not always being applicable to parallelisation in a
> stream.
> >>>>>>>>> Currently the iterator from the Hasher is used in
> >> forEachRemaining()
> >>>>>> and
> >>>>>>>>> while loop is approximately equal measure. The while loops are
> for
> >> a
> >>>>>> fast
> >>>>>>>>> exit and would be uglier if rewritten for a
> >>>>>>>>> Spliterator.tryAdvance(IntConsumer) syntax.
> >>>>>>>>>
> >>>>>>>>> There is a use of the IteratorChain in HasherBloomFilter that
> would
> >>>>>> need
> >>>>>>>>> a rethink for spliterators.
> >>>>>>>>>
> >>>>>>>>> The path of least resistance is to use Iterator<int[]> for the
> API
> >> of
> >>>>>>>>> CountingBloomFilter to be consistent with Hasher’s use of
> Iterator.
> >>>>>>>>>
> >>>>>>>>> WDYT?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>        boolean isValid()
> >>>>>>>>> Fine. Allows some level of feedback that the counts are corrupt.
> >>>>>>>>>
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Claude
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <
> >>>> [email protected]
> >>>>>>>>> <mailto:[email protected]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>> On 1 Mar 2020, at 09:28, Claude Warren <[email protected]>
> >> wrote:
> >>>>>>>>>>>> The idea of a backing array is fine and the only problem I see
> >>>> with
> >>>>>>>>> it is
> >>>>>>>>>>>> in very large filters (on the order of 10^8 bits and larger)
> but
> >>>>>>>>> document
> >>>>>>>>>>>> the size calculation and let the developer worry about it.
> >>>>>>>>>>> Let us look at the use case where we max out the array. Using
> the
> >>>>>> Bloom
> >>>>>>>>>>> filter calculator:
> >>>>>>>>>>>
> >>>>>>>>>>> n = 149,363,281
> >>>>>>>>>>> p = 0.001000025 (1 in 1000)
> >>>>>>>>>>> m = 2147483647 (256MiB)
> >>>>>>>>>>> k = 10
> >>>>>>>>>>>
> >>>>>>>>>>> n = 74,681,641
> >>>>>>>>>>> p = 0.000001 (1 in 999950)
> >>>>>>>>>>> m = 2147483647 (256MiB)
> >>>>>>>>>>> k = 20
> >>>>>>>>>>>
> >>>>>>>>>>> n = 49,787,761
> >>>>>>>>>>> p = 0.000000001 (1 in 999924899)
> >>>>>>>>>>> m = 2147483647 (256MiB)
> >>>>>>>>>>> k = 30
> >>>>>>>>>>>
> >>>>>>>>>>> So you will be able to put somewhere in the order of 10^8 or
> 10^7
> >>>>>> items
> >>>>>>>>>>> into the filter. I would say that anyone putting more than that
> >>>> into
> >>>>>>>>> the
> >>>>>>>>>>> filter has an unusual use case. The CountingBloomFilter can
> throw
> >>>> an
> >>>>>>>>>>> exception if m is too large and will throw an OutOfMemoryError
> if
> >>>> you
> >>>>>>>>>>> cannot allocate an array large enough.
> >>>>>>>>>>>
> >>>>>>>>>>> One clear point here is that you cannot use a short as a 16-bit
> >>>> count
> >>>>>>>>>>> would easily overflow. So you have to use an integer array for
> >> the
> >>>>>>>>> counts.
> >>>>>>>>>>> A maximum length int[] is roughly 8GB.
> >>>>>>>>>>>
> >>>>>>>>>>> What would another implementation cost in terms of memory? The
> >>>>>>>>>>> TreeMap<Integer, Integer> was the most space efficient. In the
> >>>>>> previous
> >>>>>>>>>>> e-mail the saturation of a Bloom filter bits was approximately
> >> 50%
> >>>>>>>>> when at
> >>>>>>>>>>> the intended capacity. So we have to estimate the size of a
> >> TreeMap
> >>>>>>>>>>> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test
> >>>> shows
> >>>>>>>>> the
> >>>>>>>>>>> TreeMap memory scales linearly with entries:
> >>>>>>>>>>>
> >>>>>>>>>>> 32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =
> >> 1834061
> >>>>>>>>> bytes
> >>>>>>>>>>> 65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =
> >> 3669080
> >>>>>>>>> bytes
> >>>>>>>>>>> 131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =
> >> 7339090
> >>>>>>>>> bytes
> >>>>>>>>>>> So what is the memory for a TreeMap with 2^30 indices. I make
> it
> >>>>>> about:
> >>>>>>>>>>> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
> >>>>>>>>>>>
> >>>>>>>>>>> I would say that this amount of RAM is unusual. It is
> definitely
> >>>> not
> >>>>>> as
> >>>>>>>>>>> efficient as using an array. So very large counting Bloom
> filters
> >>>> are
> >>>>>>>>> going
> >>>>>>>>>>> to require some thought as to the hardware they run on. This
> may
> >>>> not
> >>>>>>>>> be the
> >>>>>>>>>>> case in 10 years time.
> >>>>>>>>>>>
> >>>>>>>>>>> I would say that we try an int[] backing array for the storage
> >>>>>>>>>>> implementation and document it’s limitations. A different
> >>>>>>>>> implementation
> >>>>>>>>>>> could be provided in future if required.
> >>>>>>>>>>>
> >>>>>>>>>>> This could be done by making CountingBloomFilter an interface
> >> that
> >>>>>>>>> extends
> >>>>>>>>>>> BloomFilter with the methods:
> >>>>>>>>>>>
> >>>>>>>>>>> subtract(BloomFilter filter)
> >>>>>>>>>>> subtract(Hasher filter)
> >>>>>>>>>>>
> >>>>>>>>>>> These will negate the effect of the corresponding
> >>>> merge(BloomFilter)
> >>>>>>>>>>> operation.
> >>>>>>>>>>>
> >>>>>>>>>>> Do we also need access to the counts and add/subtract of
> another
> >>>>>>>>>>> CountingBloomFilter?:
> >>>>>>>>>>>
> >>>>>>>>>>> add(CountingBloomFilter filter);
> >>>>>>>>>>> subtract(CountingBloomFilter filter);
> >>>>>>>>>>>
> >>>>>>>>>>> Iterator<int[]> getCounts();
> >>>>>>>>>>> int getSize(); // Number of items added
> >>>>>>>>>>>
> >>>>>>>>>>> The CountingBloomFilter is then an interface that defines how
> to
> >>>>>>>>> reverse
> >>>>>>>>>>> the merge of some bits into the filter.
> >>>>>>>>>>>
> >>>>>>>>>>> My concern is the inefficiency of the creation of objects in
> any
> >>>>>> method
> >>>>>>>>>>> that provides access to the counts (e.g. above using an
> iterator
> >> as
> >>>>>> for
> >>>>>>>>>>> Hasher.getBits()). I presume this method would be to allow some
> >>>> type
> >>>>>> of
> >>>>>>>>>>> storage/serialisation of the filter akin to the long[]
> getBits()
> >>>>>>>>> method of
> >>>>>>>>>>> BloomFilter. So it may be better done using a method:
> >>>>>>>>>>>
> >>>>>>>>>>> int getCount(int index);
> >>>>>>>>>>>
> >>>>>>>>>>> The caller can then use long[] getBits() to get the indices set
> >> in
> >>>>>> the
> >>>>>>>>>>> filter and then for each non-zero bit index call
> getCount(index).
> >>>> Or
> >>>>>>>>> just
> >>>>>>>>>>> not include the method as the counts are only of concern when
> >>>> storing
> >>>>>>>>> the
> >>>>>>>>>>> filter. This functionality is cleaner pushed into an
> >>>> implementation.
> >>>>>>>>>>> In a previous post we discussed whether to throw an exception
> on
> >>>>>>>>>>> overflow/underflow or raise in an invalid flag. Using the
> invalid
> >>>>>> flag
> >>>>>>>>> idea
> >>>>>>>>>>> the interface would be:
> >>>>>>>>>>>
> >>>>>>>>>>> interface CountingBloomFilter {
> >>>>>>>>>>>        int add(CountingBloomFilter filter);
> >>>>>>>>>>>        int subtract(BloomFilter filter);
> >>>>>>>>>>>        int subtract(Hasher filter);
> >>>>>>>>>>>        int subtract(CountingBloomFilter filter);
> >>>>>>>>>>>        int getStatus();
> >>>>>>>>>>>
> >>>>>>>>>>>        // Maybe
> >>>>>>>>>>>        int getSize();
> >>>>>>>>>>>        int getCount(int index);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> The status would be negative if any operation
> >>>> overflowed/underflowed,
> >>>>>>>>> or
> >>>>>>>>>>> zero if OK. The current status is returned by the add/subtract
> >>>>>> methods.
> >>>>>>>>>>> However I note that overflow may not be a concern. The number
> of
> >>>>>> items
> >>>>>>>>> to
> >>>>>>>>>>> add to a filter to create overflow would be using a filter
> with a
> >>>>>>>>> number of
> >>>>>>>>>>> bits that is unrealistic to store in memory:
> >>>>>>>>>>>
> >>>>>>>>>>> n = 2147483647
> >>>>>>>>>>> p = 0.001000025 (1 in 1000)
> >>>>>>>>>>> m = 30875634182 (3.59GiB)
> >>>>>>>>>>> k = 10
> >>>>>>>>>>>
> >>>>>>>>>>> If you want to add 2 billion items (and overflow an integer
> >> count)
> >>>>>> then
> >>>>>>>>>>> your filter would be so big it would break the rest of the API
> >> that
> >>>>>>>>> uses a
> >>>>>>>>>>> 32-bit int for the bit index.
> >>>>>>>>>>>
> >>>>>>>>>>> Thus only underflow is a realistic concern. This could be
> >>>> documented
> >>>>>> as
> >>>>>>>>>>> handled in an implementation specific manner (i.e. throw or
> >>>> ignore).
> >>>>>>>>> The
> >>>>>>>>>>> API is then simplified to:
> >>>>>>>>>>>
> >>>>>>>>>>> interface CountingBloomFilter {
> >>>>>>>>>>>        boolean add(CountingBloomFilter filter);
> >>>>>>>>>>>        boolean subtract(BloomFilter filter);
> >>>>>>>>>>>        boolean subtract(Hasher filter);
> >>>>>>>>>>>        boolean subtract(CountingBloomFilter filter);
> >>>>>>>>>>>        int getStatus();
> >>>>>>>>>>>
> >>>>>>>>>>>        // Maybe
> >>>>>>>>>>>        int getSize();
> >>>>>>>>>>>        int getCount(int index);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> The boolean is used to state that add/subtract did not
> >>>>>> over/underflow.
> >>>>>>>>>>> Implementations can throw if they require it.
> >>>>>>>>>>>
> >>>>>>>>>>> The question then becomes what does getSize() represent if an
> >>>>>>>>> add/subtract
> >>>>>>>>>>> did not execute cleanly. Under this scheme it would be the
> number
> >>>> of
> >>>>>>>>> (add -
> >>>>>>>>>>> subtract) operations. The status flag would be used to indicate
> >> if
> >>>>>> the
> >>>>>>>>> size
> >>>>>>>>>>> is valid, or any of the counts from getCount(). The simpler API
> >> is
> >>>> to
> >>>>>>>>> not
> >>>>>>>>>>> allow access to counts/size or adding/subtracting counts:
> >>>>>>>>>>>
> >>>>>>>>>>> interface CountingBloomFilter {
> >>>>>>>>>>>        boolean subtract(BloomFilter filter);
> >>>>>>>>>>>        boolean subtract(Hasher filter);
> >>>>>>>>>>>        int getStatus();
> >>>>>>>>>>>        // Or something like ...
> >>>>>>>>>>>        boolean isValid();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> This filter is then only concerned with reversing the merge of
> >>>> Bloom
> >>>>>>>>>>> filters with a valid status flag to indicate that the current
> >> state
> >>>>>> is
> >>>>>>>>>>> consistent (i.e. all filters have been cleanly
> >> merged/subtracted).
> >>>>>>>>>>> WDYT?
> >>>>>>>>>>>
> >>>>>>>>>>>> As for the merge question.  merge is a standard bloom filter
> >>>>>>>>> operation.
> >>>>>>>>>>> It
> >>>>>>>>>>>> is well defined in the literature.  Merging a bloom filter
> into
> >> a
> >>>>>>>>>>> counting
> >>>>>>>>>>>> bloom filter means incrementing the bit counts.  I  think that
> >>>>>>>>>>> merge/remove
> >>>>>>>>>>>> should continue to operate as though the parameter were a
> >> standard
> >>>>>>>>> bloom
> >>>>>>>>>>>> filter.
> >>>>>>>>>>>>
> >>>>>>>>>>> OK. So the count is to represent the number of filters that
> had a
> >>>> bit
> >>>>>>>>> set
> >>>>>>>>>>> at that index. This makes it more clear.
> >>>>>>>>>>>
> >>>>>>>>>>>> We had spoken of implementing and adding/deleting method pair
> >> that
> >>>>>>>>> would
> >>>>>>>>>>>> operate on CountingBloomFilters and would add/subtract the
> >> counts.
> >>>>>>>>> (e.g.
> >>>>>>>>>>>> add(CountingBloomFilter) and subtract(CountingBloomFilter))
> >>>>>>>>>>>>
> >>>>>>>>>>>> I disagree with your proposal for the merge(Hasher)
> >>>> implementation,
> >>>>>>>>> and I
> >>>>>>>>>>>> am not certain that an add(Hasher) makes sense.  First
> consider
> >>>> that
> >>>>>>>>> the
> >>>>>>>>>>>> Hasher returns the bits that are to be enabled in the Bloom
> >> filter
> >>>>>> so
> >>>>>>>>>>>> collisions are expected.  In the normal course of events a
> >> Hasher
> >>>> is
> >>>>>>>>> used
> >>>>>>>>>>>> to create a normal Bloom filter where all the duplicates are
> >>>>>> removed.
> >>>>>>>>>>> That
> >>>>>>>>>>>> filter is then merged into a CountingBloomFilter.  So in some
> >>>> sense
> >>>>>>>>> the
> >>>>>>>>>>>> Hasher and the normal Bloom filter are the same.  So I would
> >>>> expect
> >>>>>>>>> the
> >>>>>>>>>>>> merge of the Hasher and the merge of the normal Bloom filter
> >>>> created
> >>>>>>>>> from
> >>>>>>>>>>>> that hasher into a CountingBloomFilter to yield the same
> result.
> >>>> If
> >>>>>>>>> you
> >>>>>>>>>>>> wanted to add an add(Hasher)/delete(Hasher) pair of functions
> >> to a
> >>>>>>>>>>>> CountingBloomFilter you could implement with duplicate
> counting,
> >>>> but
> >>>>>>>>> I am
> >>>>>>>>>>>> not certain of the validity of such a count and I fear that it
> >>>>>> muddies
> >>>>>>>>>>> the
> >>>>>>>>>>>> waters with respect to what the CountingBloomFilter is
> counting.
> >>>>>>>>>>> Agreed.
> >>>>>>>>>>>
> >>>>>>>>>>>> Claude
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <
> >>>>>>>>> [email protected]
> >>>>>>>>>>> <mailto:[email protected] <mailto:
> >> [email protected]
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> On 29 Feb 2020, at 07:46, Claude Warren <[email protected]
> >>>>>> <mailto:
> >>>>>>>>> [email protected]> <mailto:
> >>>>>>>>>>> [email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>>>>>>>> Alex would you take a look at pull request 131 [1].  it
> adds a
> >>>> new
> >>>>>>>>>>> hasher
> >>>>>>>>>>>>>> implementation and makes the HashFunctionValidator available
> >> for
> >>>>>>>>> public
> >>>>>>>>>>>>> use.
> >>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131> <
> >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131>> <
> >>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131> <
> >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131>>>
> >>>>>>>>>>>>> OK. I’ll take a look.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I’ve been thinking about the counting Bloom filter and the
> >>>> backing
> >>>>>>>>>>>>> storage. In summary:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. The backing storage should be a fixed array.
> >>>>>>>>>>>>> 2. Merging a Hasher should count duplicate indices, not
> flatten
> >>>>>> them
> >>>>>>>>> all
> >>>>>>>>>>>>> to a single count.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For background I’ve used the standard formulas to estimate
> the
> >>>>>>>>> number of
> >>>>>>>>>>>>> indices that will be non-zero in a Bloom filter. The
> wikipedia
> >>>> page
> >>>>>>>>>>> gives
> >>>>>>>>>>>>> this formula for the expected number of bits set to 0 (E(q))
> if
> >>>> you
> >>>>>>>>> have
> >>>>>>>>>>>>> inserted i elements into a filter of size m using k hash
> >>>> functions:
> >>>>>>>>>>>>> E(q) = (1 - 1/m)^ki"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So a rough guess of the number of indices (bits) used by a
> >> filter
> >>>>>> is
> >>>>>>>>>>>>> 1-E(q).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here is a table of Bloom filters with different collision
> >>>>>>>>> probabilities
> >>>>>>>>>>>>> and the proportion of bits that will be set when 1%, 10%,
> 100%
> >> of
> >>>>>> the
> >>>>>>>>>>>>> capacity of the filter has been met:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> n       p       m       k       I       E(q)    bits
> >>>>>>>>>>>>> 1000    1E-04   19171   13      10      0.9932  0.0068
> >>>>>>>>>>>>> 1000    1E-04   19171   13      100     0.9344  0.0656
> >>>>>>>>>>>>> 1000    1E-04   19171   13      1000    0.5076  0.4924
> >>>>>>>>>>>>> 1000    1E-05   23963   17      10      0.9929  0.0071
> >>>>>>>>>>>>> 1000    1E-05   23963   17      100     0.9315  0.0685
> >>>>>>>>>>>>> 1000    1E-05   23963   17      1000    0.4919  0.5081
> >>>>>>>>>>>>> 1000    1E-06   28756   20      10      0.9931  0.0069
> >>>>>>>>>>>>> 1000    1E-06   28756   20      100     0.9328  0.0672
> >>>>>>>>>>>>> 1000    1E-06   28756   20      1000    0.4988  0.5012
> >>>>>>>>>>>>> 10000   1E-06   287552  20      100     0.9931  0.0069
> >>>>>>>>>>>>> 10000   1E-06   287552  20      1000    0.9328  0.0672
> >>>>>>>>>>>>> 10000   1E-06   287552  20      10000   0.4988  0.5012
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The point is that if you create a Bloom filter and fill it to
> >> 10%
> >>>>>> of
> >>>>>>>>> the
> >>>>>>>>>>>>> intended capacity the number of indices used will be about
> 6-7%
> >>>> of
> >>>>>>>>> the
> >>>>>>>>>>>>> filter bits.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So how to store the counts? Currently the counting bloom
> filter
> >>>>>> uses
> >>>>>>>>> a
> >>>>>>>>>>>>> TreeMap<Integer, Integer>. I tried:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> TreeMap<Integer, Integer>
> >>>>>>>>>>>>> HashMap<Integer, Integer>
> >>>>>>>>>>>>> TreeSet<MutableCount>
> >>>>>>>>>>>>> HashSet<MutableCount>
> >>>>>>>>>>>>> int[]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The MutableCount is a custom object that stores the bit index
> >> and
> >>>>>>>>> uses
> >>>>>>>>>>> it
> >>>>>>>>>>>>> for a hash code and then has a mutable integer count field.
> It
> >>>>>> allows
> >>>>>>>>>>> the
> >>>>>>>>>>>>> count to be incremented/decremented if the object is in the
> >> set:
> >>>>>>>>>>>>>       static final class MutableCount implements
> >>>>>>>>> Comparable<MutableCount> {
> >>>>>>>>>>>>>           final int key;
> >>>>>>>>>>>>>           int count;
> >>>>>>>>>>>>>           // etc
> >>>>>>>>>>>>>       }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is adapted from the Bag<T> collection which stores an
> item
> >>>>>> count
> >>>>>>>>>>> with
> >>>>>>>>>>>>> a MutableInteger. Here the mutable count is part of the same
> >>>>>> object T
> >>>>>>>>>>>>> inserted into the Set. So you can find the object, change the
> >>>> count
> >>>>>>>>> and
> >>>>>>>>>>> not
> >>>>>>>>>>>>> have to put it back into the set. This is more efficient than
> >>>>>>>>> changing
> >>>>>>>>>>> the
> >>>>>>>>>>>>> Integer stored in a Map.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I’ve estimated memory usage using an idea based on this
> article
> >>>>>> from
> >>>>>>>>>>>>> JavaWorld: Java Tip 130: Do you know your data size? [1].
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Basically you:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - create an object and throw it away. All classes are then
> >>>>>>>>> initialised.
> >>>>>>>>>>>>> - Then you free memory (run garbage collection) and get the
> >>>> current
> >>>>>>>>>>> memory
> >>>>>>>>>>>>> usage
> >>>>>>>>>>>>> - Then create a lot of your object (held in an array)
> >>>>>>>>>>>>> - Then measure memory usage again
> >>>>>>>>>>>>> - memory = (after - before) / count
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here is some example output for n bits set in size m:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =
> >>>>    733947
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =
> >>>> 1467866
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =
> >>>>    838928
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =
> >>>> 1677776
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =
> >>>> 1677712
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =
> >>>> 2306739
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 13107 / 262144 (0.050) : HashSet<MutableCount>          =
> >>>> 1782664
> >>>>>>>>> bytes
> >>>>>>>>>>>>> 26214 / 262144 (0.100) : HashSet<MutableCount>          =
> >>>> 2516656
> >>>>>>>>> bytes
> >>>>>>>>>>>>>        0 / 262144 (0.000) : int[]                          =
> >>>> 1048608
> >>>>>>>>> bytes
> >>>>>>>>>>>>>        0 / 262144 (0.000) : short[]                        =
> >>>>    524320
> >>>>>>>>> bytes
> >>>>>>>>>>>>> The estimate is accurate to 0.0001% for the arrays so the
> >> method
> >>>> is
> >>>>>>>>>>>>> working. The HashMap was created with the capacity set to the
> >>>>>>>>> expected
> >>>>>>>>>>>>> capacity of the filter (m).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I’ve chosen these sizes because at 5% full a HashSet is less
> >>>> memory
> >>>>>>>>>>>>> efficient than using a fixed size array, and at 10% the
> TreeSet
> >>>> is
> >>>>>>>>> also
> >>>>>>>>>>>>> less efficient.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Note that the java.util.Tree/HashSet versions just wrap a Map
> >> and
> >>>>>>>>>>> insert a
> >>>>>>>>>>>>> dummy object for all keys in the Map. So here a Set is not as
> >>>>>>>>> efficient
> >>>>>>>>>>> as
> >>>>>>>>>>>>> a Map because in the Map test I always inserted the same
> >> Integer
> >>>>>>>>> object
> >>>>>>>>>>>>> representing 1. This would be the same as using a Set with an
> >>>>>> Integer
> >>>>>>>>>>> key
> >>>>>>>>>>>>> but here the Set had to contain the MutableCount which has an
> >>>> extra
> >>>>>>>>> int
> >>>>>>>>>>>>> field and is larger than an Integer.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> These data lead me to think that a counting Bloom filter
> should
> >>>>>> just
> >>>>>>>>>>> use a
> >>>>>>>>>>>>> fixed size backing array:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - If created using the same Shape as a standard Bloom filter
> it
> >>>>>> uses
> >>>>>>>>> a
> >>>>>>>>>>>>> fixed size. This has high memory cost when the filter is
> empty
> >>>> but
> >>>>>>>>> when
> >>>>>>>>>>> it
> >>>>>>>>>>>>> exceeds 10% of the intended capacity it is more efficient
> than
> >> a
> >>>>>>>>> dynamic
> >>>>>>>>>>>>> backing storage.
> >>>>>>>>>>>>> - All operations will operate in order(n) time for an
> operation
> >>>>>> with
> >>>>>>>>>>>>> another filter with n indices. Each individual index count in
> >> the
> >>>>>>>>> filter
> >>>>>>>>>>>>> will have order(1) time for access/update. Performance will
> be
> >>>>>>>>> limited
> >>>>>>>>>>> by
> >>>>>>>>>>>>> the memory cache of the entire array.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The issue is that a counting Bloom filter with a very low
> >> number
> >>>> of
> >>>>>>>>>>>>> inserted items will be memory inefficient. But under what
> >>>>>>>>> circumstance
> >>>>>>>>>>> will
> >>>>>>>>>>>>> such a filter be used in a short-term lifecycle? If it is
> >> simply
> >>>> to
> >>>>>>>>>>> merge
> >>>>>>>>>>>>> into another filter then this can be done using a merge with
> a
> >>>>>>>>> Hasher.
> >>>>>>>>>>> If
> >>>>>>>>>>>>> counts are to be merged then perhaps we provide a method to
> >> merge
> >>>>>>>>> counts
> >>>>>>>>>>>>> using the same data structure returned by the
> >> CountingBloomFilter
> >>>>>>>>>>>>> getCounts() method, e.g. using a stream of <index,count>
> pairs:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Stream<int[]> getCounts();
> >>>>>>>>>>>>> void add(Stream<int[]> counts);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The issue here is the Shape and HashFunctionIdentity of the
> >>>> origin
> >>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>>> merge cannot be validated. So just leave it out and use the
> >> merge
> >>>>>>>>> with a
> >>>>>>>>>>>>> Hasher.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thus the next issue with the counting Bloom filter
> >>>> implementation.
> >>>>>>>>>>>>> Currently when it merges with a Hasher it puts all the
> indices
> >>>>>> into a
> >>>>>>>>>>> Set
> >>>>>>>>>>>>> and so will only increment the count by 1 for each index
> >>>> identified
> >>>>>>>>> by
> >>>>>>>>>>> the
> >>>>>>>>>>>>> Hasher. This appears to miss the entire point of the counting
> >>>> Bloom
> >>>>>>>>>>> filter.
> >>>>>>>>>>>>> If I hash an objects to generate k indices I would hope that
> I
> >> do
> >>>>>>>>> get k
> >>>>>>>>>>>>> indices. But the hash may not be perfect and I may get [1, k]
> >>>>>> indices
> >>>>>>>>>>> with
> >>>>>>>>>>>>> some duplications. This is part of the signature of that
> object
> >>>>>> with
> >>>>>>>>> the
> >>>>>>>>>>>>> given hash. So surely a counting Bloom filter should
> >> accommodate
> >>>>>>>>> this.
> >>>>>>>>>>> If
> >>>>>>>>>>>>> my Hasher generates the same index 20 times this should
> result
> >> in
> >>>>>> the
> >>>>>>>>>>> count
> >>>>>>>>>>>>> of that index incrementing by 20.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The result if that if an object is added direct to a counting
> >>>> Bloom
> >>>>>>>>>>> filter
> >>>>>>>>>>>>> using a Hasher it will have a different result that if added
> >> to a
> >>>>>>>>>>> standard
> >>>>>>>>>>>>> Bloom filter and then that filter added to the counting Bloom
> >>>>>> filter.
> >>>>>>>>>>>>> Opinions on this?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]
> >>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
> >>>>>> <
> >>>>>>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html>
> >>>>>>>>>>>>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <
> >>>>>>>>>>> [email protected] <mailto:[email protected]>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have created a PR that contains most of the changes
> >> discussed
> >>>>>> in
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>> thread (see [1]).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Please review the changes and comment here or on GitHub.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have left the CountingBloomFilter alone. I will
> reimplement
> >>>> the
> >>>>>>>>>>>>>>> add/subtract functionality as discussed into another PR.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1] https://github.com/apache/commons-collections/pull/133
> <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/133> <
> >>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/133 <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/133>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> >>>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>>>>>> http://www.linkedin.com/in/claudewarren>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/> <
> >>>>>>>>> http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
> >>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>>>>>> http://www.linkedin.com/in/claudewarren> <
> >>>>>>>>>>> http://www.linkedin.com/in/claudewarren <
> >>>>>>>>> http://www.linkedin.com/in/claudewarren>>
> >>>>>>>>>> --
> >>>>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> >>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>>>>>> http://www.linkedin.com/in/claudewarren>
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>> <http://like-like.xenei.com>
> >>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to