Re: [collections] Bloom filters

Claude Warren Mon, 02 Mar 2020 14:35:15 -0800

So what we have then is:

*public* *interface* BloomFilter {


     *int* andCardinality(BloomFilter other);

     *int* cardinality();

     *boolean* contains(BloomFilter other);

     *boolean* contains(Hasher hasher);

     *long*[] getBits();

   // Change
   PrimitiveIterator.OfInt iterator();

     Shape getShape();


*   // Change   boolean* merge(BloomFilter other);


*// Change   boolean* merge(Hasher hasher);

     *int* orCardinality(BloomFilter other);

     *int* xorCardinality(BloomFilter other);

}

public interface BitCountConsumer {
    boolean accept(int index, int count);
}

public interface CountingBloomFilter extends BloomFilter {


*// Change  boolean* remove(BloomFilter other);

    // Change
    *boolean* remove(Hasher hasher);

    // Change
  boolean add(CountingBloomFilter other);

    // Change
    boolean subtract(CountingBloomFilter other);

    // Change
   void getCounts(BitCountConsumer consumer);

}

*public* *final* *class* StaticHasher *implements* Hasher {

     *public* StaticHasher(*final* Hasher hasher, *final* Shape shape) { ...
}

     *public* StaticHasher(*final* Iterator<Integer> iter, *final* Shape
shape) { ... }

     *public* StaticHasher(*final* Collection<Integer> collection, *final*
Shape shape) { ... }

    *public* Shape getShape() { *... *}

    *public* *int* size() { *... *}


*// Change  public* *int* max() { *... *}

}

*public* *final* *class* FixedHasher *implements* Hasher {

    *public* FixedHasher(*final* Hasher hasher, *final* Shape shape) { ... }

    *public* FixedHasher(*final* Iterator<Integer> iter, *final* Shape shape)
{ ... }

    *public* FixedHasher(*final* Collection<Integer> collection, *final*
Shape shape) { ... }

    *public* Shape getShape() { *... *}

    *public* *int* size() { *... *}

    *public* *int* max() { *... *}

}

I have a pull request in to add a CachingHasher.

While we are at it, I think we should change Hasher.getBits(Shape) to
Hasher.iterator(Shape) to match the BloomFilter.iterator() and to limit
confusion between Hasher.getBits( Shape ) (return iterator) and
BloomFilter.getBits() returns (long[])

The interface changes look good to me.  I agree with your implementation
ideas.

I do have one use case where I need the BloomFilter.iterator() results in
reverse order (I am estimating Log2 of the filter), but I can do that with
the methods here.  I don't think any of these changes significantly impacts
the uses that I currently have.

Shall we move forward?

Claude


On Mon, Mar 2, 2020 at 6:02 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
> On 02/03/2020 16:12, Claude Warren wrote:
> > Does getCounts() return a snapshot of the values when the call was made
> or
> > does it return values that may be updated during the retrieval.  If there
> > are 2 threads (one reading counts and one doing a merge) it seems to me
> > that the "iterate over the data without constructing objects" approach
> > means that the data may not be internally consistent.  But then we don't
> > really check that issue in the other methods either.  So let's go with
> the
> > fail fast BitCountConsumer approach.
>
> My assumption on this is that no filter is safe for concurrent use.
> Allowing a merge while reading the state is a concurrency issue. I don't
> think we want to go down the route of having fail-fast concurrency
> checking using a 'modified count' strategy, i.e. all operations save the
> current 'modified count' at the start and increment it at the end, if it
> changes before the end then there has been concurrent modification. It
> would be simpler to state that no filters are designed to be used
> concurrently.
>
> >
> > On the StaticHasher it is used for two purposes:
> >
> > 1) getting a list of enabled bits.
> > 2) creating a bloom filter from a list of enabled bits.  There was a
> > request early on for the ability to test membership in Bloom filter
> without
> > having to construct a bloom filter.  Basically, the developer has a list
> of
> > values and wants to check membership.
>
> I think the issue is that for 1 and 2 there are a few scenarios with
> different requirements since the list of enabled bits can be small or big.
>
> One feature of the StaticHasher is the requirement to remove duplicates
> and return a sorted order. This allows for an efficient storage form
> factor with the intension that it may be around for a while. But it is
> an overhead when creating a large one from a known set of non-duplicate
> bits, i.e. a Bloom filter.
>
> For (2) if you want to do a dynamic check to test membership then you
> use BloomFilter.contains(Hasher). I would pass an on-the-fly Hasher for
> this and not construct the entire set of indices into a StaticHasher
> anyway.
>
> This would be done using the DynamicHasher to build hashes on-the-fly,
> which may have duplicates. For the use case where you already know the
> bit indices but do not want to create a BloomFilter or a StaticHasher
> then perhaps we need another Hasher that is a wrapper around an int[]
> that allows duplicates.
>
> // Returns an unordered set of values that may contain duplicates.
>
>      static Hasher wrap(final int[] values, final Shape shape) {
>          Objects.requireNonNull(values);
>          Objects.requireNonNull(shape);
>          return new Hasher() {
>              @Override
>              public OfInt getBits(Shape sameShape) {
>                  if (!shape.equals(sameShape)) {
>                      throw new IllegalArgumentException();
>                  }
>                  return Arrays.stream(values).iterator();
>              }
>
>              @Override
>              public HashFunctionIdentity getHashFunctionIdentity() {
>                  return shape.getHashFunctionIdentity();
>              }
>
>              @Override
>              public boolean isEmpty() {
>                  return values.length == 0;
>              }
>          };
>      }
>
>
> > It might make more sense to reimplement the StaticHasher using a long[]
> > internally (like the BloomFilter.getBits() method returns).  We have the
> > shape so we know the maximum length and we could trim it after loading to
> > remove trailing zero value longs.  This would remove the necessity of
> > converting the iterator into a set of some sort.  We can do the limit
> > checks quickly as we turn bits on.
> For small numbers of indices the long[] format is likely to be less
> space efficient.
> >
> > Makes me think we might need to implement StandardBloomFilter to use
> long[]
> > as well.
>
> That is just the BitSetBloomFilter. So if you use this format in the
> StaticHasher then you have no advantage over actually creating a
> BitSetBloomFilter.
>
> The StaticHasher is returned by the BloomFilter interface. You also have
> a method to return the bits using the long[] format. So there is
> redundancy here. To drop the redundancy I think you separate the uses
> for the canonical representation of the entire filter as a long[] and
> the uses for the set bits. If the later is satisfied by the StaticHasher
> then it should be replaced by an iterator instead.
>
> Looking at the code the getHasher() method in BloomFilter is only used
> in HasherBloomFilter. This is to add more enabled bits to the current
> set it contains. It seems that this use case is better served by using a
> Set as the backing storage.
>
> This leaves the StaticHasher with no usage in the current library. It is
> not as fast as wrapping a provided array. It offers no expansion of
> storage for use in the HasherBloomFilter. It does however provide the
> smallest form for a compact BloomFilter representation. So it could be
> renamed to ImmutableHasher.
>
> I would think it best to consolidate all of this:
>
> - StaticHasher to stay as an immutable instance of sorted bit indices.
> The intention is an efficient compact form for a few bits of a
> BloomFilter than can be merged efficiently. It could be used with
> pre-exisiting indices but would have to eliminate duplicates and sort them.
> - Remove the redundant StaticHasher constructor accepting a StaticHasher
> instance since they are immutable.
> - Possibly add a method to StaticHasher to merge another StaticHasher to
> create a new instance.
> - Possibly add a method to StaticHasher to get the maximum index in the
> set (it already has size()). This helps convert it to a long[] or any
> other class that will call getBits() and use the iterator output.
>
> - BloomFilter to remove the getHasher() method. The long[] getBits()
> method to be documented as returning the canonical format of the filter.
> The getHasher() method is just duplicating this information in another
> format.
> - BloomFilter to have an iterator() method to get a
> PrimitiveIterator.OfInt. This replaces the getHasher() method which just
> provided access to the Shape and the PrimitiveIterator.OfInt of the
> Hasher. BloomFilter already has getShape() so it just needs an iterator.
>
> - HasherBloomFilter to use the iterator() method when merging a generic
> BloomFilter. It can detect instances of HasherBloomFilter and switch to
> a merge operation of its internal representation.
> - HasherBloomFilter to use different backing storage (e.g. a
> Set<Integer>) or adapt StaticHasher to be more efficient for merge.
>
> - Add a new class akin to StaticHasher that allows duplicates. Document
> it as a fast way to create a few bits of a BloomFilter.
>
> You then have:
>
> Hasher
>
> - StaticHasher - immutable set of ordered non-duplicate bits. Can be
> merged with another instance.
>
> - DynamicHasher - generates bits on the fly using a hashing function
>
> - FixedHasher - an array of indices. Can wrap an input array or be
> created using constructors similar to StaticHasher but without the
> guarantees of non-duplicates.
>
>
> Lots to think about there.
>
> >
> > Claude
> >
> > On Mon, Mar 2, 2020 at 1:12 PM Alex Herbert <alex.d.herb...@gmail.com>
> > wrote:
> >
> >> On 02/03/2020 11:32, Claude Warren wrote:
> >>> my thought on changing the BloomFilter.merge() to return a boolean is
> >> along
> >>> the lines you had: return true on successful merge (even if there are
> no
> >>> changes in the enabled bits). And yes, for most cases the standard
> bloom
> >>> filter will return true, but the change is really to support extensions
> >> to
> >>> Bloom filter so I think it is reasonable.
> >>>
> >>> As for the getCounts().  Suppose we split it into two methods:
> >>>
> >>>       // return the indexes that are enabled.  This is equivalent to
> >>> CountingBloomFilter.getHasher().getBits( CountingBloomFilter.getShape()
> >> );
> >>>       Iterator<int> getIndexes()
> >>>       // get the count for the specific index.
> >>>       int getCount( int index );
> >>>
> >>> With these methods It becomes possible to construct an iterator of
> int[]
> >> or
> >>> Map.Entry<Integer,Integer> or whatever else the developer wants.
> >> Having to call getCount() for every index is likely to be slower than
> >> all the garbage collection involved with iterating a disposable <index,
> >> count> pair. For an array backed storage the access will be order(1).
> >> But other storage may be a lot slower. This method ideally should
> >> traverse the storage once and make each <index, count> pair available.
> >> Ideally this would be without creating each <index, count> pair as a
> >> disposable object.
> >>
> >> Following on from previously, another disadvantage of the consumer
> >> approach is the lack of fast fail. You can remedy this by adding a
> >> boolean return from the consumer to indicate if you want to consume more
> >> items:
> >>
> >> interface BitCountConsumer {
> >>       boolean accept(int index, int count);
> >> }
> >>
> >> A CountingBloomFilter implementation then has something like:
> >>
> >>       void getCounts(BitCountConsumer consumer) {
> >>           while(hasNext()) {
> >>               next();
> >>               if (!consumer.accept(index(), count()) {
> >>                   break;
> >>               }
> >>           }
> >>       }
> >>
> >> Looking at the getHasher().getBits(...) idea I noticed that StaticHasher
> >> is constructed using:
> >>
> >> Iterator<Integer> iter
> >>
> >> not:
> >>
> >> PrimitiveIterator.OfInt
> >>
> >> I'd prefer the primitive specialisation in this constructor. It would
> >> break the HasherBloomFilter merge function but that can be fixed.
> >> However it may be redundant. The StaticHasher has some oddities with the
> >> available constructors:
> >>
> >> public StaticHasher(final StaticHasher hasher, final Shape shape)
> >>
> >> - why have a shape parameter? The constructor checks the shape is the
> >> same as that for the hasher. So why not just a plain copy constructor:
> >>
> >> public StaticHasher(final StaticHasher hasher)
> >>
> >> This constructor is only used in the unit tests. Given the StaticHasher
> >> is immutable then a copy constructor seems odd. I think it should be
> >> dropped.
> >>
> >>
> >> Here are is a constructor that is missing IMO:
> >>
> >> public StaticHasher(final Set<Integer> indices, final Shape shape)
> >>
> >> - Could be used to generically create a static hasher. The indices are
> >> ensured to be unique by the Set but should be checked to be within the
> >> shape.
> >>
> >>
> >> With these constructors the public StaticHasher(final Iterator<Integer>
> >> iter, final Shape shape) is only used by the BitSetBloomFilter:
> >>
> >>       public StaticHasher getHasher() {
> >>           return new StaticHasher(bitSet.stream().iterator(),
> getShape());
> >>       }
> >>
> >> Becomes:
> >>
> >>       public StaticHasher getHasher() {
> >>           return new
> >>
> StaticHasher(bitSet.stream().boxed().collect(Collectors.toCollection(TreeSet::new)),
> >>
> >> getShape());
> >>       }
> >>
> >> This is a very roundabout way to construct a StaticHasher. Perhaps a
> >> constructor accepting a BitSet would be useful?
> >>
> >> So the API only requires:
> >>
> >> public StaticHasher(final Hasher hasher, final Shape shape)
> >> public StaticHasher(final Set<Integer> indices, final Shape shape)
> >> // Maybe
> >> public StaticHasher(final BitSet indices, final Shape shape)
> >>
> >>
> >> I wondered about a merge function for the StaticHasher:
> >>
> >> public StaticHasher merge(final StaticHasher other)
> >>
> >> The merge function can make some assumptions about the two arrays to
> >> merge as the class is final and the values are verified to be sorted and
> >> within the Shape. However there is not a use for it other than in the
> >> merge of two HasherBloomFilters, which is not a recommended operation
> >> anyway.
> >>
> >>
> >>> Claude
> >>>
> >>> On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert <alex.d.herb...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> On 02/03/2020 09:38, Claude Warren wrote:
> >>>>> It is not too late to update the BloomFIlter interface to have the
> >> merge
> >>>>> return a boolean.  The incorrect Shape would still throw an
> exception,
> >> so
> >>>>> the return value would only come into play if the bits could not be
> >> set.
> >>>>> thoughts?
> >>>> I don't see the harm in it. But what would the return value be for?
> >>>>
> >>>> For a standard collection it would be if the collection was changed by
> >>>> the operation:
> >>>>
> >>>> Collection.add/remove return "true if this collection changed as a
> >>>> result of the call"
> >>>>
> >>>> So here is the equivalent:
> >>>>
> >>>> return "true if this filter was changed as a result of the call"
> >>>>
> >>>> This is computationally slow to track. It also is confusing if the
> >>>> filter was successfully merged but no bits were changed to then return
> >>>> false because the filter was actually incorporated. So it would go
> along
> >>>> the lines that we discussed for the counting Bloom filter:
> >>>>
> >>>> return "true if this filter was successfully merged as a result of the
> >>>> call"
> >>>>
> >>>> For most cases in the current library it would be true when an
> exception
> >>>> is not thrown. However the merge of the counting Bloom filter may have
> >>>> reason to return false, e.g. overflow.
> >>>>
> >>>>> On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <cla...@xenei.com>
> wrote:
> >>>>>
> >>>>>> for the remove(), add(), and subtract() methods I agree that void is
> >> not
> >>>>>> correct and it should be boolean and be the same as the value you
> >> would
> >>>> get
> >>>>>> from calling isValid().
> >>>>>>
> >>>>>> You are correct the getCounts() should return an iterator of some
> type
> >>>> on
> >>>>>> int[], I don't know why I thought long[].  I am happy with a plain
> >>>>>> Iterator<int[]> as the return.
> >>>> For the getCounts() method I am still looking for a way around having
> to
> >>>> create an <index, count> pair for everything in the filter. An
> >>>> alternative to an iterator is to use the consumer idea. Given there is
> >>>> no primitive specialisation of BiConsumer<T, U> in JDK 8 functions we
> >>>> define our own:
> >>>>
> >>>> interface BitCountConsumer {
> >>>>        void accept(int index, int count);
> >>>> }
> >>>>
> >>>> The CountingBloomFilter then has:
> >>>>
> >>>>        void forEachCount(BitCountConsumer consumer);
> >>>>        // Or
> >>>>        void getCounts(BitCountConsumer consumer);
> >>>>
> >>>>
> >>>> You can then directly pass the counts from the backing storage to the
> >>>> destination.
> >>>>
> >>>> Advantages:
> >>>> - No object creation
> >>>> Disadvantages
> >>>> - The counts cannot be streamed
> >>>>
> >>>> An alternative is to provide an Iterator of long with the index and
> >>>> count packed into a long with methods to extract them:
> >>>>
> >>>>       PrimativeIterator.OfLong getCounts();
> >>>>
> >>>>       default static int getIndex(long indexCountPair) {
> >>>>            return (int) (indexCountPair >>> 32);
> >>>>       }
> >>>>
> >>>>       default static int getCount(long indexCountPair) {
> >>>>            return (int) indexCountPair;
> >>>>       }
> >>>>
> >>>> This will almost certainly be a cause for bugs/issues from users.
> >>>>
> >>>> I believe that the counts will be used for 2 main use cases:
> >>>>
> >>>> 1. Storage
> >>>>
> >>>> 2. Adding to another counting Bloom filter
> >>>>
> >>>> Both cases are likely to be done serially and not in parallel. So
> >>>> providing a consumer based API to receive the counts would work.
> >>>>
> >>>> WDYT?
> >>>>
> >>>>>> Claude
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert <
> alex.d.herb...@gmail.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>>> On 1 Mar 2020, at 15:39, Claude Warren <cla...@xenei.com> wrote:
> >>>>>>>>
> >>>>>>>> I think the CountingBloomFilter interface needs to extend
> >> BloomFilter.
> >>>>>>> I said that but did not write it, sorry.
> >>>>>>>
> >>>>>>>> I think I am confused.
> >>>>>>>>
> >>>>>>>> I would expect CountingBloomFilter to have
> >>>>>>>>
> >>>>>>>> interface CountingBloomFilter extends BloomFilter {
> >>>>>>>>       // these 2 methods are the reverse of merge()
> >>>>>>>>       void remove( BloomFilter );
> >>>>>>>>       void remove( Hasher );
> >>>>>>> Fine. Same intention but different method names. But why void? It
> >>>> forces
> >>>>>>> you to check if the remove was valid with a second call. On the
> plus
> >>>> side
> >>>>>>> it matches the void merge(…) methods and in most cases a user would
> >> not
> >>>>>>> care to check anyway. If they are controlling the filter then leave
> >> it
> >>>> to
> >>>>>>> them to make sure they do not remove something they did not add.
> >>>>>>>
> >>>>>>>>       // these 2 methods are the addition/subtraction of counts
> >>>>>>>>       void add( CountingBloomFilter )
> >>>>>>>>       void subtract( CountingBloomFilter );
> >>>>>>> Fine. But same comment as above with the void return.
> >>>>>>>
> >>>>>>>>       // 2 methods to retrieve data
> >>>>>>>>       Stream<long[]> getCounts();
> >>>>>>> I don’t like this use of long[]. In my previous post I argued that
> if
> >>>> you
> >>>>>>> were to ever want to store more than max integer items in a filter
> >>>> then the
> >>>>>>> Bloom filter would have more bit indices than max integer. So we
> >> never
> >>>> have
> >>>>>>> to support long counts. A filter that exceeds max integer for a
> count
> >>>> is
> >>>>>>> highly likely to be saturated and no use as a filter anyway.
> >>>>>>>
> >>>>>>> For most backing implementations the object type of the stream will
> >> be
> >>>>>>> different so you will have to write a Spliterator<T> implementation
> >> or
> >>>> wrap
> >>>>>>> some iterator anyway. So why not return the Spliterator:
> >>>>>>>
> >>>>>>>       Spliterator<int[]> getCounts();
> >>>>>>>
> >>>>>>> Since the backing implementation will likely not store int[] pairs
> >> then
> >>>>>>> this will have a lot of object creation and garbage collection
> >>>> overhead to
> >>>>>>> go through the counts. This does not seem to be a big concern here
> if
> >>>> the
> >>>>>>> purpose is the same as for the BloomFilter for long[] getBits(),
> i.e.
> >>>> to
> >>>>>>> get a canonical representation for storage.
> >>>>>>>
> >>>>>>> Note: The Spliterator may not have a known size (number of non zero
> >>>> bits)
> >>>>>>> at creation, for example if the counts are stored in a fixed size
> >>>> array.
> >>>>>>> Thus efficient parallel traversal by binary tree splitting is
> limited
> >>>> by
> >>>>>>> how evenly the counts are distributed. For a backing implementation
> >>>> using a
> >>>>>>> collection then the size should be known. In this case a
> Spliterator
> >>>> would
> >>>>>>> be of more use than a plain Iterator. You can convert one to the
> >> other
> >>>>>>> using:
> >>>>>>>
> >>>>>>> java.util.Spliterators:
> >>>>>>> public static<T> Iterator<T> iterator(Spliterator<? extends T>
> >>>>>>> spliterator)
> >>>>>>> public static <T> Spliterator<T> spliterator(Iterator<? extends T>
> >>>>>>> iterator,
> >>>>>>>                                                     long size,
> >>>>>>>                                                     int
> >> characteristics)
> >>>>>>> So which to choose for the API?
> >>>>>>>
> >>>>>>> The Hasher currently uses an Iterator:
> >>>>>>>
> >>>>>>> PrimitiveIterator.OfInt getBits(Shape shape);
> >>>>>>>
> >>>>>>> In the case of a StaticHasher this could return a spliterator. But
> >> the
> >>>>>>> DynamicHasher needs a reworking of the internal Iterator class. It
> >>>> could be
> >>>>>>> a Spliterator to use the new IntConsumer API but in most (all)
> cases
> >>>>>>> splitting would not be possible for dynamic hashing as the parts
> are
> >>>>>>> produced in order. It is likely that they will be consumed
> >>>> sequentially too.
> >>>>>>> I would suggest that Spliterator is the more modern implementation,
> >>>>>>> despite not always being applicable to parallelisation in a stream.
> >>>>>>> Currently the iterator from the Hasher is used in
> forEachRemaining()
> >>>> and
> >>>>>>> while loop is approximately equal measure. The while loops are for
> a
> >>>> fast
> >>>>>>> exit and would be uglier if rewritten for a
> >>>>>>> Spliterator.tryAdvance(IntConsumer) syntax.
> >>>>>>>
> >>>>>>> There is a use of the IteratorChain in HasherBloomFilter that would
> >>>> need
> >>>>>>> a rethink for spliterators.
> >>>>>>>
> >>>>>>> The path of least resistance is to use Iterator<int[]> for the API
> of
> >>>>>>> CountingBloomFilter to be consistent with Hasher’s use of Iterator.
> >>>>>>>
> >>>>>>> WDYT?
> >>>>>>>
> >>>>>>>
> >>>>>>>>       boolean isValid()
> >>>>>>> Fine. Allows some level of feedback that the counts are corrupt.
> >>>>>>>
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> Claude
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <
> >> alex.d.herb...@gmail.com
> >>>>>>> <mailto:alex.d.herb...@gmail.com>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> The idea of a backing array is fine and the only problem I see
> >> with
> >>>>>>> it is
> >>>>>>>>>> in very large filters (on the order of 10^8 bits and larger) but
> >>>>>>> document
> >>>>>>>>>> the size calculation and let the developer worry about it.
> >>>>>>>>> Let us look at the use case where we max out the array. Using the
> >>>> Bloom
> >>>>>>>>> filter calculator:
> >>>>>>>>>
> >>>>>>>>> n = 149,363,281
> >>>>>>>>> p = 0.001000025 (1 in 1000)
> >>>>>>>>> m = 2147483647 (256MiB)
> >>>>>>>>> k = 10
> >>>>>>>>>
> >>>>>>>>> n = 74,681,641
> >>>>>>>>> p = 0.000001 (1 in 999950)
> >>>>>>>>> m = 2147483647 (256MiB)
> >>>>>>>>> k = 20
> >>>>>>>>>
> >>>>>>>>> n = 49,787,761
> >>>>>>>>> p = 0.000000001 (1 in 999924899)
> >>>>>>>>> m = 2147483647 (256MiB)
> >>>>>>>>> k = 30
> >>>>>>>>>
> >>>>>>>>> So you will be able to put somewhere in the order of 10^8 or 10^7
> >>>> items
> >>>>>>>>> into the filter. I would say that anyone putting more than that
> >> into
> >>>>>>> the
> >>>>>>>>> filter has an unusual use case. The CountingBloomFilter can throw
> >> an
> >>>>>>>>> exception if m is too large and will throw an OutOfMemoryError if
> >> you
> >>>>>>>>> cannot allocate an array large enough.
> >>>>>>>>>
> >>>>>>>>> One clear point here is that you cannot use a short as a 16-bit
> >> count
> >>>>>>>>> would easily overflow. So you have to use an integer array for
> the
> >>>>>>> counts.
> >>>>>>>>> A maximum length int[] is roughly 8GB.
> >>>>>>>>>
> >>>>>>>>> What would another implementation cost in terms of memory? The
> >>>>>>>>> TreeMap<Integer, Integer> was the most space efficient. In the
> >>>> previous
> >>>>>>>>> e-mail the saturation of a Bloom filter bits was approximately
> 50%
> >>>>>>> when at
> >>>>>>>>> the intended capacity. So we have to estimate the size of a
> TreeMap
> >>>>>>>>> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test
> >> shows
> >>>>>>> the
> >>>>>>>>> TreeMap memory scales linearly with entries:
> >>>>>>>>>
> >>>>>>>>> 32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =
> 1834061
> >>>>>>> bytes
> >>>>>>>>> 65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =
> 3669080
> >>>>>>> bytes
> >>>>>>>>> 131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =
> 7339090
> >>>>>>> bytes
> >>>>>>>>> So what is the memory for a TreeMap with 2^30 indices. I make it
> >>>> about:
> >>>>>>>>> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
> >>>>>>>>>
> >>>>>>>>> I would say that this amount of RAM is unusual. It is definitely
> >> not
> >>>> as
> >>>>>>>>> efficient as using an array. So very large counting Bloom filters
> >> are
> >>>>>>> going
> >>>>>>>>> to require some thought as to the hardware they run on. This may
> >> not
> >>>>>>> be the
> >>>>>>>>> case in 10 years time.
> >>>>>>>>>
> >>>>>>>>> I would say that we try an int[] backing array for the storage
> >>>>>>>>> implementation and document it’s limitations. A different
> >>>>>>> implementation
> >>>>>>>>> could be provided in future if required.
> >>>>>>>>>
> >>>>>>>>> This could be done by making CountingBloomFilter an interface
> that
> >>>>>>> extends
> >>>>>>>>> BloomFilter with the methods:
> >>>>>>>>>
> >>>>>>>>> subtract(BloomFilter filter)
> >>>>>>>>> subtract(Hasher filter)
> >>>>>>>>>
> >>>>>>>>> These will negate the effect of the corresponding
> >> merge(BloomFilter)
> >>>>>>>>> operation.
> >>>>>>>>>
> >>>>>>>>> Do we also need access to the counts and add/subtract of another
> >>>>>>>>> CountingBloomFilter?:
> >>>>>>>>>
> >>>>>>>>> add(CountingBloomFilter filter);
> >>>>>>>>> subtract(CountingBloomFilter filter);
> >>>>>>>>>
> >>>>>>>>> Iterator<int[]> getCounts();
> >>>>>>>>> int getSize(); // Number of items added
> >>>>>>>>>
> >>>>>>>>> The CountingBloomFilter is then an interface that defines how to
> >>>>>>> reverse
> >>>>>>>>> the merge of some bits into the filter.
> >>>>>>>>>
> >>>>>>>>> My concern is the inefficiency of the creation of objects in any
> >>>> method
> >>>>>>>>> that provides access to the counts (e.g. above using an iterator
> as
> >>>> for
> >>>>>>>>> Hasher.getBits()). I presume this method would be to allow some
> >> type
> >>>> of
> >>>>>>>>> storage/serialisation of the filter akin to the long[] getBits()
> >>>>>>> method of
> >>>>>>>>> BloomFilter. So it may be better done using a method:
> >>>>>>>>>
> >>>>>>>>> int getCount(int index);
> >>>>>>>>>
> >>>>>>>>> The caller can then use long[] getBits() to get the indices set
> in
> >>>> the
> >>>>>>>>> filter and then for each non-zero bit index call getCount(index).
> >> Or
> >>>>>>> just
> >>>>>>>>> not include the method as the counts are only of concern when
> >> storing
> >>>>>>> the
> >>>>>>>>> filter. This functionality is cleaner pushed into an
> >> implementation.
> >>>>>>>>> In a previous post we discussed whether to throw an exception on
> >>>>>>>>> overflow/underflow or raise in an invalid flag. Using the invalid
> >>>> flag
> >>>>>>> idea
> >>>>>>>>> the interface would be:
> >>>>>>>>>
> >>>>>>>>> interface CountingBloomFilter {
> >>>>>>>>>       int add(CountingBloomFilter filter);
> >>>>>>>>>       int subtract(BloomFilter filter);
> >>>>>>>>>       int subtract(Hasher filter);
> >>>>>>>>>       int subtract(CountingBloomFilter filter);
> >>>>>>>>>       int getStatus();
> >>>>>>>>>
> >>>>>>>>>       // Maybe
> >>>>>>>>>       int getSize();
> >>>>>>>>>       int getCount(int index);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> The status would be negative if any operation
> >> overflowed/underflowed,
> >>>>>>> or
> >>>>>>>>> zero if OK. The current status is returned by the add/subtract
> >>>> methods.
> >>>>>>>>> However I note that overflow may not be a concern. The number of
> >>>> items
> >>>>>>> to
> >>>>>>>>> add to a filter to create overflow would be using a filter with a
> >>>>>>> number of
> >>>>>>>>> bits that is unrealistic to store in memory:
> >>>>>>>>>
> >>>>>>>>> n = 2147483647
> >>>>>>>>> p = 0.001000025 (1 in 1000)
> >>>>>>>>> m = 30875634182 (3.59GiB)
> >>>>>>>>> k = 10
> >>>>>>>>>
> >>>>>>>>> If you want to add 2 billion items (and overflow an integer
> count)
> >>>> then
> >>>>>>>>> your filter would be so big it would break the rest of the API
> that
> >>>>>>> uses a
> >>>>>>>>> 32-bit int for the bit index.
> >>>>>>>>>
> >>>>>>>>> Thus only underflow is a realistic concern. This could be
> >> documented
> >>>> as
> >>>>>>>>> handled in an implementation specific manner (i.e. throw or
> >> ignore).
> >>>>>>> The
> >>>>>>>>> API is then simplified to:
> >>>>>>>>>
> >>>>>>>>> interface CountingBloomFilter {
> >>>>>>>>>       boolean add(CountingBloomFilter filter);
> >>>>>>>>>       boolean subtract(BloomFilter filter);
> >>>>>>>>>       boolean subtract(Hasher filter);
> >>>>>>>>>       boolean subtract(CountingBloomFilter filter);
> >>>>>>>>>       int getStatus();
> >>>>>>>>>
> >>>>>>>>>       // Maybe
> >>>>>>>>>       int getSize();
> >>>>>>>>>       int getCount(int index);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> The boolean is used to state that add/subtract did not
> >>>> over/underflow.
> >>>>>>>>> Implementations can throw if they require it.
> >>>>>>>>>
> >>>>>>>>> The question then becomes what does getSize() represent if an
> >>>>>>> add/subtract
> >>>>>>>>> did not execute cleanly. Under this scheme it would be the number
> >> of
> >>>>>>> (add -
> >>>>>>>>> subtract) operations. The status flag would be used to indicate
> if
> >>>> the
> >>>>>>> size
> >>>>>>>>> is valid, or any of the counts from getCount(). The simpler API
> is
> >> to
> >>>>>>> not
> >>>>>>>>> allow access to counts/size or adding/subtracting counts:
> >>>>>>>>>
> >>>>>>>>> interface CountingBloomFilter {
> >>>>>>>>>       boolean subtract(BloomFilter filter);
> >>>>>>>>>       boolean subtract(Hasher filter);
> >>>>>>>>>       int getStatus();
> >>>>>>>>>       // Or something like ...
> >>>>>>>>>       boolean isValid();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> This filter is then only concerned with reversing the merge of
> >> Bloom
> >>>>>>>>> filters with a valid status flag to indicate that the current
> state
> >>>> is
> >>>>>>>>> consistent (i.e. all filters have been cleanly
> merged/subtracted).
> >>>>>>>>>
> >>>>>>>>> WDYT?
> >>>>>>>>>
> >>>>>>>>>> As for the merge question.  merge is a standard bloom filter
> >>>>>>> operation.
> >>>>>>>>> It
> >>>>>>>>>> is well defined in the literature.  Merging a bloom filter into
> a
> >>>>>>>>> counting
> >>>>>>>>>> bloom filter means incrementing the bit counts.  I  think that
> >>>>>>>>> merge/remove
> >>>>>>>>>> should continue to operate as though the parameter were a
> standard
> >>>>>>> bloom
> >>>>>>>>>> filter.
> >>>>>>>>>>
> >>>>>>>>> OK. So the count is to represent the number of filters that had a
> >> bit
> >>>>>>> set
> >>>>>>>>> at that index. This makes it more clear.
> >>>>>>>>>
> >>>>>>>>>> We had spoken of implementing and adding/deleting method pair
> that
> >>>>>>> would
> >>>>>>>>>> operate on CountingBloomFilters and would add/subtract the
> counts.
> >>>>>>> (e.g.
> >>>>>>>>>> add(CountingBloomFilter) and subtract(CountingBloomFilter))
> >>>>>>>>>>
> >>>>>>>>>> I disagree with your proposal for the merge(Hasher)
> >> implementation,
> >>>>>>> and I
> >>>>>>>>>> am not certain that an add(Hasher) makes sense.  First consider
> >> that
> >>>>>>> the
> >>>>>>>>>> Hasher returns the bits that are to be enabled in the Bloom
> filter
> >>>> so
> >>>>>>>>>> collisions are expected.  In the normal course of events a
> Hasher
> >> is
> >>>>>>> used
> >>>>>>>>>> to create a normal Bloom filter where all the duplicates are
> >>>> removed.
> >>>>>>>>> That
> >>>>>>>>>> filter is then merged into a CountingBloomFilter.  So in some
> >> sense
> >>>>>>> the
> >>>>>>>>>> Hasher and the normal Bloom filter are the same.  So I would
> >> expect
> >>>>>>> the
> >>>>>>>>>> merge of the Hasher and the merge of the normal Bloom filter
> >> created
> >>>>>>> from
> >>>>>>>>>> that hasher into a CountingBloomFilter to yield the same result.
> >> If
> >>>>>>> you
> >>>>>>>>>> wanted to add an add(Hasher)/delete(Hasher) pair of functions
> to a
> >>>>>>>>>> CountingBloomFilter you could implement with duplicate counting,
> >> but
> >>>>>>> I am
> >>>>>>>>>> not certain of the validity of such a count and I fear that it
> >>>> muddies
> >>>>>>>>> the
> >>>>>>>>>> waters with respect to what the CountingBloomFilter is counting.
> >>>>>>>>> Agreed.
> >>>>>>>>>
> >>>>>>>>>> Claude
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <
> >>>>>>> alex.d.herb...@gmail.com
> >>>>>>>>> <mailto:alex.d.herb...@gmail.com <mailto:
> alex.d.herb...@gmail.com
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com
> >>>> <mailto:
> >>>>>>> cla...@xenei.com> <mailto:
> >>>>>>>>> cla...@xenei.com <mailto:cla...@xenei.com>>> wrote:
> >>>>>>>>>>>> Alex would you take a look at pull request 131 [1].  it adds a
> >> new
> >>>>>>>>> hasher
> >>>>>>>>>>>> implementation and makes the HashFunctionValidator available
> for
> >>>>>>> public
> >>>>>>>>>>> use.
> >>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>> https://github.com/apache/commons-collections/pull/131> <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>> https://github.com/apache/commons-collections/pull/131>> <
> >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>> https://github.com/apache/commons-collections/pull/131> <
> >>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
> >>>>>>> https://github.com/apache/commons-collections/pull/131>>>
> >>>>>>>>>>> OK. I’ll take a look.
> >>>>>>>>>>>
> >>>>>>>>>>> I’ve been thinking about the counting Bloom filter and the
> >> backing
> >>>>>>>>>>> storage. In summary:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. The backing storage should be a fixed array.
> >>>>>>>>>>> 2. Merging a Hasher should count duplicate indices, not flatten
> >>>> them
> >>>>>>> all
> >>>>>>>>>>> to a single count.
> >>>>>>>>>>>
> >>>>>>>>>>> For background I’ve used the standard formulas to estimate the
> >>>>>>> number of
> >>>>>>>>>>> indices that will be non-zero in a Bloom filter. The wikipedia
> >> page
> >>>>>>>>> gives
> >>>>>>>>>>> this formula for the expected number of bits set to 0 (E(q)) if
> >> you
> >>>>>>> have
> >>>>>>>>>>> inserted i elements into a filter of size m using k hash
> >> functions:
> >>>>>>>>>>> E(q) = (1 - 1/m)^ki"
> >>>>>>>>>>>
> >>>>>>>>>>> So a rough guess of the number of indices (bits) used by a
> filter
> >>>> is
> >>>>>>>>>>> 1-E(q).
> >>>>>>>>>>>
> >>>>>>>>>>> Here is a table of Bloom filters with different collision
> >>>>>>> probabilities
> >>>>>>>>>>> and the proportion of bits that will be set when 1%, 10%, 100%
> of
> >>>> the
> >>>>>>>>>>> capacity of the filter has been met:
> >>>>>>>>>>>
> >>>>>>>>>>> n       p       m       k       I       E(q)    bits
> >>>>>>>>>>> 1000    1E-04   19171   13      10      0.9932  0.0068
> >>>>>>>>>>> 1000    1E-04   19171   13      100     0.9344  0.0656
> >>>>>>>>>>> 1000    1E-04   19171   13      1000    0.5076  0.4924
> >>>>>>>>>>> 1000    1E-05   23963   17      10      0.9929  0.0071
> >>>>>>>>>>> 1000    1E-05   23963   17      100     0.9315  0.0685
> >>>>>>>>>>> 1000    1E-05   23963   17      1000    0.4919  0.5081
> >>>>>>>>>>> 1000    1E-06   28756   20      10      0.9931  0.0069
> >>>>>>>>>>> 1000    1E-06   28756   20      100     0.9328  0.0672
> >>>>>>>>>>> 1000    1E-06   28756   20      1000    0.4988  0.5012
> >>>>>>>>>>> 10000   1E-06   287552  20      100     0.9931  0.0069
> >>>>>>>>>>> 10000   1E-06   287552  20      1000    0.9328  0.0672
> >>>>>>>>>>> 10000   1E-06   287552  20      10000   0.4988  0.5012
> >>>>>>>>>>>
> >>>>>>>>>>> The point is that if you create a Bloom filter and fill it to
> 10%
> >>>> of
> >>>>>>> the
> >>>>>>>>>>> intended capacity the number of indices used will be about 6-7%
> >> of
> >>>>>>> the
> >>>>>>>>>>> filter bits.
> >>>>>>>>>>>
> >>>>>>>>>>> So how to store the counts? Currently the counting bloom filter
> >>>> uses
> >>>>>>> a
> >>>>>>>>>>> TreeMap<Integer, Integer>. I tried:
> >>>>>>>>>>>
> >>>>>>>>>>> TreeMap<Integer, Integer>
> >>>>>>>>>>> HashMap<Integer, Integer>
> >>>>>>>>>>> TreeSet<MutableCount>
> >>>>>>>>>>> HashSet<MutableCount>
> >>>>>>>>>>> int[]
> >>>>>>>>>>>
> >>>>>>>>>>> The MutableCount is a custom object that stores the bit index
> and
> >>>>>>> uses
> >>>>>>>>> it
> >>>>>>>>>>> for a hash code and then has a mutable integer count field. It
> >>>> allows
> >>>>>>>>> the
> >>>>>>>>>>> count to be incremented/decremented if the object is in the
> set:
> >>>>>>>>>>>
> >>>>>>>>>>>      static final class MutableCount implements
> >>>>>>> Comparable<MutableCount> {
> >>>>>>>>>>>          final int key;
> >>>>>>>>>>>          int count;
> >>>>>>>>>>>          // etc
> >>>>>>>>>>>      }
> >>>>>>>>>>>
> >>>>>>>>>>> This is adapted from the Bag<T> collection which stores an item
> >>>> count
> >>>>>>>>> with
> >>>>>>>>>>> a MutableInteger. Here the mutable count is part of the same
> >>>> object T
> >>>>>>>>>>> inserted into the Set. So you can find the object, change the
> >> count
> >>>>>>> and
> >>>>>>>>> not
> >>>>>>>>>>> have to put it back into the set. This is more efficient than
> >>>>>>> changing
> >>>>>>>>> the
> >>>>>>>>>>> Integer stored in a Map.
> >>>>>>>>>>>
> >>>>>>>>>>> I’ve estimated memory usage using an idea based on this article
> >>>> from
> >>>>>>>>>>> JavaWorld: Java Tip 130: Do you know your data size? [1].
> >>>>>>>>>>>
> >>>>>>>>>>> Basically you:
> >>>>>>>>>>>
> >>>>>>>>>>> - create an object and throw it away. All classes are then
> >>>>>>> initialised.
> >>>>>>>>>>> - Then you free memory (run garbage collection) and get the
> >> current
> >>>>>>>>> memory
> >>>>>>>>>>> usage
> >>>>>>>>>>> - Then create a lot of your object (held in an array)
> >>>>>>>>>>> - Then measure memory usage again
> >>>>>>>>>>> - memory = (after - before) / count
> >>>>>>>>>>>
> >>>>>>>>>>> Here is some example output for n bits set in size m:
> >>>>>>>>>>>
> >>>>>>>>>>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =
> >>   733947
> >>>>>>> bytes
> >>>>>>>>>>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =
> >> 1467866
> >>>>>>> bytes
> >>>>>>>>>>> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =
> >>   838928
> >>>>>>> bytes
> >>>>>>>>>>> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =
> >> 1677776
> >>>>>>> bytes
> >>>>>>>>>>> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =
> >> 1677712
> >>>>>>> bytes
> >>>>>>>>>>> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =
> >> 2306739
> >>>>>>> bytes
> >>>>>>>>>>> 13107 / 262144 (0.050) : HashSet<MutableCount>          =
> >> 1782664
> >>>>>>> bytes
> >>>>>>>>>>> 26214 / 262144 (0.100) : HashSet<MutableCount>          =
> >> 2516656
> >>>>>>> bytes
> >>>>>>>>>>>       0 / 262144 (0.000) : int[]                          =
> >> 1048608
> >>>>>>> bytes
> >>>>>>>>>>>       0 / 262144 (0.000) : short[]                        =
> >>   524320
> >>>>>>> bytes
> >>>>>>>>>>> The estimate is accurate to 0.0001% for the arrays so the
> method
> >> is
> >>>>>>>>>>> working. The HashMap was created with the capacity set to the
> >>>>>>> expected
> >>>>>>>>>>> capacity of the filter (m).
> >>>>>>>>>>>
> >>>>>>>>>>> I’ve chosen these sizes because at 5% full a HashSet is less
> >> memory
> >>>>>>>>>>> efficient than using a fixed size array, and at 10% the TreeSet
> >> is
> >>>>>>> also
> >>>>>>>>>>> less efficient.
> >>>>>>>>>>>
> >>>>>>>>>>> Note that the java.util.Tree/HashSet versions just wrap a Map
> and
> >>>>>>>>> insert a
> >>>>>>>>>>> dummy object for all keys in the Map. So here a Set is not as
> >>>>>>> efficient
> >>>>>>>>> as
> >>>>>>>>>>> a Map because in the Map test I always inserted the same
> Integer
> >>>>>>> object
> >>>>>>>>>>> representing 1. This would be the same as using a Set with an
> >>>> Integer
> >>>>>>>>> key
> >>>>>>>>>>> but here the Set had to contain the MutableCount which has an
> >> extra
> >>>>>>> int
> >>>>>>>>>>> field and is larger than an Integer.
> >>>>>>>>>>>
> >>>>>>>>>>> These data lead me to think that a counting Bloom filter should
> >>>> just
> >>>>>>>>> use a
> >>>>>>>>>>> fixed size backing array:
> >>>>>>>>>>>
> >>>>>>>>>>> - If created using the same Shape as a standard Bloom filter it
> >>>> uses
> >>>>>>> a
> >>>>>>>>>>> fixed size. This has high memory cost when the filter is empty
> >> but
> >>>>>>> when
> >>>>>>>>> it
> >>>>>>>>>>> exceeds 10% of the intended capacity it is more efficient than
> a
> >>>>>>> dynamic
> >>>>>>>>>>> backing storage.
> >>>>>>>>>>> - All operations will operate in order(n) time for an operation
> >>>> with
> >>>>>>>>>>> another filter with n indices. Each individual index count in
> the
> >>>>>>> filter
> >>>>>>>>>>> will have order(1) time for access/update. Performance will be
> >>>>>>> limited
> >>>>>>>>> by
> >>>>>>>>>>> the memory cache of the entire array.
> >>>>>>>>>>>
> >>>>>>>>>>> The issue is that a counting Bloom filter with a very low
> number
> >> of
> >>>>>>>>>>> inserted items will be memory inefficient. But under what
> >>>>>>> circumstance
> >>>>>>>>> will
> >>>>>>>>>>> such a filter be used in a short-term lifecycle? If it is
> simply
> >> to
> >>>>>>>>> merge
> >>>>>>>>>>> into another filter then this can be done using a merge with a
> >>>>>>> Hasher.
> >>>>>>>>> If
> >>>>>>>>>>> counts are to be merged then perhaps we provide a method to
> merge
> >>>>>>> counts
> >>>>>>>>>>> using the same data structure returned by the
> CountingBloomFilter
> >>>>>>>>>>> getCounts() method, e.g. using a stream of <index,count> pairs:
> >>>>>>>>>>>
> >>>>>>>>>>> Stream<int[]> getCounts();
> >>>>>>>>>>> void add(Stream<int[]> counts);
> >>>>>>>>>>>
> >>>>>>>>>>> The issue here is the Shape and HashFunctionIdentity of the
> >> origin
> >>>> of
> >>>>>>>>> the
> >>>>>>>>>>> merge cannot be validated. So just leave it out and use the
> merge
> >>>>>>> with a
> >>>>>>>>>>> Hasher.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thus the next issue with the counting Bloom filter
> >> implementation.
> >>>>>>>>>>> Currently when it merges with a Hasher it puts all the indices
> >>>> into a
> >>>>>>>>> Set
> >>>>>>>>>>> and so will only increment the count by 1 for each index
> >> identified
> >>>>>>> by
> >>>>>>>>> the
> >>>>>>>>>>> Hasher. This appears to miss the entire point of the counting
> >> Bloom
> >>>>>>>>> filter.
> >>>>>>>>>>> If I hash an objects to generate k indices I would hope that I
> do
> >>>>>>> get k
> >>>>>>>>>>> indices. But the hash may not be perfect and I may get [1, k]
> >>>> indices
> >>>>>>>>> with
> >>>>>>>>>>> some duplications. This is part of the signature of that object
> >>>> with
> >>>>>>> the
> >>>>>>>>>>> given hash. So surely a counting Bloom filter should
> accommodate
> >>>>>>> this.
> >>>>>>>>> If
> >>>>>>>>>>> my Hasher generates the same index 20 times this should result
> in
> >>>> the
> >>>>>>>>> count
> >>>>>>>>>>> of that index incrementing by 20.
> >>>>>>>>>>>
> >>>>>>>>>>> The result if that if an object is added direct to a counting
> >> Bloom
> >>>>>>>>> filter
> >>>>>>>>>>> using a Hasher it will have a different result that if added
> to a
> >>>>>>>>> standard
> >>>>>>>>>>> Bloom filter and then that filter added to the counting Bloom
> >>>> filter.
> >>>>>>>>>>> Opinions on this?
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> [1]
> >> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
> >>>> <
> >>>>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html>
> >>>>>>>>>>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <
> >>>>>>>>> alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I have created a PR that contains most of the changes
> discussed
> >>>> in
> >>>>>>>>> this
> >>>>>>>>>>>>> thread (see [1]).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Please review the changes and comment here or on GitHub.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have left the CountingBloomFilter alone. I will reimplement
> >> the
> >>>>>>>>>>>>> add/subtract functionality as discussed into another PR.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1] https://github.com/apache/commons-collections/pull/133 <
> >>>>>>> https://github.com/apache/commons-collections/pull/133> <
> >>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/133 <
> >>>>>>> https://github.com/apache/commons-collections/pull/133>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> >>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>>>> http://www.linkedin.com/in/claudewarren>
> >>>>>>>>>> --
> >>>>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/> <
> >>>>>>> http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
> >>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>>>> http://www.linkedin.com/in/claudewarren> <
> >>>>>>>>> http://www.linkedin.com/in/claudewarren <
> >>>>>>> http://www.linkedin.com/in/claudewarren>>
> >>>>>>>> --
> >>>>>>>> I like: Like Like - The likeliest place on the web
> >>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> >>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
> >>>>>>> http://www.linkedin.com/in/claudewarren>
> >>>>>>>
> >>>>>> --
> >>>>>> I like: Like Like - The likeliest place on the web
> >>>>>> <http://like-like.xenei.com>
> >>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>>
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >> For additional commands, e-mail: dev-h...@commons.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to