Re: [collections] Bloom filters

Alex Herbert Tue, 03 Mar 2020 16:58:22 -0800


> On 3 Mar 2020, at 22:31, Claude Warren <cla...@xenei.com> wrote:
> 
> take a look at
> https://github.com/apache/commons-collections/pull/131/files#diff-8b2bf046dc35c88908eef196937173e1
> 
> This is a different Hasher with a smaller data footprint than
> DynamicHasher. Like the DynamicHasher it builds the values on the fly.
> Can SplitIterators be implemented on them easily?


Yes. You use the same functions, just all in one go:

public boolean tryAdvance(IntConsumer action) {
    if (hasNext()) {
        action.accept(nextInt());
        return true;
    }
    return false;
}

You can convert between the two using:

java.util.Spliterators:
public static PrimitiveIterator.OfInt iterator(Spliterator.OfInt spliterator)
public static Spliterator.OfInt spliteratorUnknownSize(PrimitiveIterator.OfInt 
iterator,
                                                       int characteristics) {

The conversion of an iterator to a spliterator done by this class acts exactly 
like the above code.


> 
> I am torn between the interesting uses of a SplitIterator and the well
> known usage pattern for an iterator.  Perhaps we should implement both.  It
> seems to me (without really looking) that we should be able to implement an
> Iterator that uses the SplitIterator.
> 
> Thoughts?

I think you could create a class that implements both at the same time. But 
most (all?) of the JDK collections have dedicated ones for each. From what I 
have seen in the bloom filter classes I think the Spliterator can be more 
efficient and will not be less efficient than an iterator for traversing the 
filter/hasher elements.

> 
> On Tue, Mar 3, 2020 at 11:54 AM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> 
>> 
>> On 02/03/2020 22:34, Claude Warren wrote:
>>> So what we have then is:
>>> 
>>> *public* *interface* BloomFilter {
>>> 
>>>      *int* andCardinality(BloomFilter other);
>>> 
>>>      *int* cardinality();
>>> 
>>>      *boolean* contains(BloomFilter other);
>>> 
>>>      *boolean* contains(Hasher hasher);
>>> 
>>>      *long*[] getBits();
>>> 
>>>    // Change
>>>    PrimitiveIterator.OfInt iterator();
>>> 
>>>      Shape getShape();
>>> 
>>> 
>>> *   // Change   boolean* merge(BloomFilter other);
>>> 
>>> 
>>> *// Change   boolean* merge(Hasher hasher);
>>> 
>>>      *int* orCardinality(BloomFilter other);
>>> 
>>>      *int* xorCardinality(BloomFilter other);
>>> 
>>> }
>>> 
>>> public interface BitCountConsumer {
>>>     boolean accept(int index, int count);
>>> }
>>> 
>>> public interface CountingBloomFilter extends BloomFilter {
>>> 
>>> 
>>> *// Change  boolean* remove(BloomFilter other);
>>> 
>>>     // Change
>>>     *boolean* remove(Hasher hasher);
>>> 
>>>     // Change
>>>   boolean add(CountingBloomFilter other);
>>> 
>>>     // Change
>>>     boolean subtract(CountingBloomFilter other);
>>> 
>>>     // Change
>>>    void getCounts(BitCountConsumer consumer);
>>> 
>>> }
>>> 
>>> *public* *final* *class* StaticHasher *implements* Hasher {
>>> 
>>>      *public* StaticHasher(*final* Hasher hasher, *final* Shape shape)
>> { ...
>>> }
>>> 
>>>      *public* StaticHasher(*final* Iterator<Integer> iter, *final* Shape
>>> shape) { ... }
>>> 
>>>      *public* StaticHasher(*final* Collection<Integer> collection,
>> *final*
>>> Shape shape) { ... }
>>> 
>>>     *public* Shape getShape() { *... *}
>>> 
>>>     *public* *int* size() { *... *}
>>> 
>>> 
>>> *// Change  public* *int* max() { *... *}
>>> 
>>> }
>>> 
>>> *public* *final* *class* FixedHasher *implements* Hasher {
>>> 
>>>     *public* FixedHasher(*final* Hasher hasher, *final* Shape shape) {
>> ... }
>>> 
>>>     *public* FixedHasher(*final* Iterator<Integer> iter, *final* Shape
>> shape)
>>> { ... }
>>> 
>>>     *public* FixedHasher(*final* Collection<Integer> collection, *final*
>>> Shape shape) { ... }
>>> 
>>>     *public* Shape getShape() { *... *}
>>> 
>>>     *public* *int* size() { *... *}
>>> 
>>>     *public* *int* max() { *... *}
>>> 
>>> }
>>> 
>>> I have a pull request in to add a CachingHasher.
>>> 
>>> While we are at it, I think we should change Hasher.getBits(Shape) to
>>> Hasher.iterator(Shape) to match the BloomFilter.iterator() and to limit
>>> confusion between Hasher.getBits( Shape ) (return iterator) and
>>> BloomFilter.getBits() returns (long[])
>>> 
>>> The interface changes look good to me.  I agree with your implementation
>>> ideas.
>>> 
>>> I do have one use case where I need the BloomFilter.iterator() results in
>>> reverse order (I am estimating Log2 of the filter), but I can do that
>> with
>>> the methods here.  I don't think any of these changes significantly
>> impacts
>>> the uses that I currently have.
>>> 
>>> Shall we move forward?
>>> 
>>> Claude
>> 
>> I finished a first draft of the CountingBloomFilter. I still have the
>> tests to add for the new API.
>> 
>> When working with this I had to write an Iterator (to create a Hasher
>> for the soon to be changed API) which has very few lines of code, most
>> of which are repeats due to the separation of hasNext() and nextInt()
>> and the requirement to be able to call hasNext() repeatedly without
>> calling nextInt(). This mandates storing the next value rather than a
>> current position to search for a non-zero index:
>> 
>>     static class IndexIterator implements PrimitiveIterator.OfInt {
>>         int next;
>> 
>>         IndexIterator() {
>>             advance(0);
>>         }
>> 
>>         private void advance(int from) {
>>             next = from;
>>             while (next < counts.length && counts[next] == 0) {
>>                 next++;
>>             }
>>         }
>> 
>>         @Override
>>         public boolean hasNext() {
>>             return next < counts.length;
>>         }
>> 
>>         @Override
>>         public int nextInt() {
>>             if (hasNext()) {
>>                 final int result = next;
>>                 advance(next + 1);
>>                 return result;
>>             }
>>             throw new NoSuchElementException();
>>         }
>>     }
>> 
>> This is noted in the Spliterator javadoc:
>> 
>>  * <p>Spliterators, like {@code Iterator}s, are for traversing the
>> elements of
>>  * a source.  The {@code Spliterator} API was designed to support
>> efficient
>>  * parallel traversal in addition to sequential traversal, by supporting
>>  * decomposition as well as single-element iteration.  In addition, the
>>  * protocol for accessing elements via a Spliterator is designed to impose
>>  * smaller per-element overhead than {@code Iterator}, and to avoid the
>> inherent
>>  * race involved in having separate methods for {@code hasNext()} and
>>  * {@code next()}.
>> 
>> The same code written as a Spliterator has a much simpler tryAdvance
>> method with only a single comparison of the current index to the upper
>> limit per access of each element.
>> 
>>     static class IndexSpliterator implements Spliterator.OfInt {
>>         int index;
>> 
>>         @Override
>>         public long estimateSize() {
>>             return Long.MAX_VALUE;
>>         }
>> 
>>         @Override
>>         public int characteristics() {
>>             return Spliterator.SORTED | Spliterator.ORDERED |
>>                    Spliterator.DISTINCT | Spliterator.NONNULL;
>>         }
>> 
>>         @Override
>>         public OfInt trySplit() {
>>             return null;
>>         }
>> 
>>         @Override
>>         public boolean tryAdvance(IntConsumer action) {
>>             while (index < counts.length) {
>>                 final int count = counts[index++];
>>                 if (count != 0) {
>>                     action.accept(count);
>>                     return true;
>>                 }
>>             }
>>             return false;
>>         }
>>     }
>> 
>> Or with splitting allowed (because we can). The tryAdvance() is the same
>> just with a different end point:
>> 
>>     static class IndexSpliterator implements Spliterator.OfInt {
>>         int index;
>>         final int end;
>> 
>>         IndexSpliterator() {
>>             this(0, counts.length);
>>         }
>> 
>>         IndexSpliterator(int start, int end) {
>>             index = start;
>>             this.end = end;
>>         }
>> 
>>         @Override
>>         public long estimateSize() {
>>             // This is an approximation.
>>             // A Bloom filter that is at capacity will have approximately
>>             // half of the bits set.
>>             return (end - index) / 2;
>>         }
>> 
>>         @Override
>>         public int characteristics() {
>>             return Spliterator.SORTED | Spliterator.ORDERED |
>>                    Spliterator.DISTINCT | Spliterator.NONNULL;
>>         }
>> 
>>         @Override
>>         public OfInt trySplit() {
>>             final int middle = (index + end) >>> 1;
>>             if (middle > index) {
>>                 final int start = index;
>>                 index = middle;
>>                 return new IndexSpliterator(start, middle);
>>             }
>>             return null;
>>         }
>> 
>>         @Override
>>         public boolean tryAdvance(IntConsumer action) {
>>             while (index < end) {
>>                 final int count = counts[index++];
>>                 if (count != 0) {
>>                     action.accept(count);
>>                     return true;
>>                 }
>>             }
>>             return false;
>>         }
>>     }
>> 
>> What do you think to changing the use of Iterator to Spliterator for
>> efficiency purposes?
>> 
>> This would add:
>> 
>> BloomFilter {
>>     Spliterator.OfInt spliterator();
>> }
>> 
>> Hasher {
>>     Spliterator.OfInt spliterator(Shape shape);
>> }
>> 
>> CountingBloomFilter extends BloomFilter {
>>     // ???
>>     Spliterator<int[]> countSpliterator();
>> }
>> 
>> The final one is for completeness so you can use the streams API with
>> the counts. I don't really like it. I'd prefer to pack the index and
>> count into a long as stated previously to maximise efficiency. My
>> suggestion would be to not have a spliterator for the counts and leave
>> the access of <index, count> pairs to an explicit forEach consumer.
>> 
>> Most of the current classes use a collection that provides a spliterator
>> so the change is easy. The DynamicHasher Iterator would have to be
>> rewritten but the Spliterator is simpler and all the code is already there.
>> 
>> I am not sold on adding a default stream method as is done in the
>> java.util.Collections API:
>> 
>>     default IntStream stream() {
>>         return StreamSupport.stream(spliterator(), false);
>>     }
>> 
>> What do you think? I hope this is not going full circle back to an
>> earlier API which was previously discussed. My aim is to make the API as
>> efficient as possible and so starting with Spliterator rather than
>> Iterator seems to be a better approach.
>>> 
>>> 
>>> On Mon, Mar 2, 2020 at 6:02 PM Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>> 
>>>> On 02/03/2020 16:12, Claude Warren wrote:
>>>>> Does getCounts() return a snapshot of the values when the call was made
>>>> or
>>>>> does it return values that may be updated during the retrieval.  If
>> there
>>>>> are 2 threads (one reading counts and one doing a merge) it seems to me
>>>>> that the "iterate over the data without constructing objects" approach
>>>>> means that the data may not be internally consistent.  But then we
>> don't
>>>>> really check that issue in the other methods either.  So let's go with
>>>> the
>>>>> fail fast BitCountConsumer approach.
>>>> My assumption on this is that no filter is safe for concurrent use.
>>>> Allowing a merge while reading the state is a concurrency issue. I don't
>>>> think we want to go down the route of having fail-fast concurrency
>>>> checking using a 'modified count' strategy, i.e. all operations save the
>>>> current 'modified count' at the start and increment it at the end, if it
>>>> changes before the end then there has been concurrent modification. It
>>>> would be simpler to state that no filters are designed to be used
>>>> concurrently.
>>>> 
>>>>> On the StaticHasher it is used for two purposes:
>>>>> 
>>>>> 1) getting a list of enabled bits.
>>>>> 2) creating a bloom filter from a list of enabled bits.  There was a
>>>>> request early on for the ability to test membership in Bloom filter
>>>> without
>>>>> having to construct a bloom filter.  Basically, the developer has a
>> list
>>>> of
>>>>> values and wants to check membership.
>>>> I think the issue is that for 1 and 2 there are a few scenarios with
>>>> different requirements since the list of enabled bits can be small or
>> big.
>>>> 
>>>> One feature of the StaticHasher is the requirement to remove duplicates
>>>> and return a sorted order. This allows for an efficient storage form
>>>> factor with the intension that it may be around for a while. But it is
>>>> an overhead when creating a large one from a known set of non-duplicate
>>>> bits, i.e. a Bloom filter.
>>>> 
>>>> For (2) if you want to do a dynamic check to test membership then you
>>>> use BloomFilter.contains(Hasher). I would pass an on-the-fly Hasher for
>>>> this and not construct the entire set of indices into a StaticHasher
>>>> anyway.
>>>> 
>>>> This would be done using the DynamicHasher to build hashes on-the-fly,
>>>> which may have duplicates. For the use case where you already know the
>>>> bit indices but do not want to create a BloomFilter or a StaticHasher
>>>> then perhaps we need another Hasher that is a wrapper around an int[]
>>>> that allows duplicates.
>>>> 
>>>> // Returns an unordered set of values that may contain duplicates.
>>>> 
>>>>      static Hasher wrap(final int[] values, final Shape shape) {
>>>>          Objects.requireNonNull(values);
>>>>          Objects.requireNonNull(shape);
>>>>          return new Hasher() {
>>>>              @Override
>>>>              public OfInt getBits(Shape sameShape) {
>>>>                  if (!shape.equals(sameShape)) {
>>>>                      throw new IllegalArgumentException();
>>>>                  }
>>>>                  return Arrays.stream(values).iterator();
>>>>              }
>>>> 
>>>>              @Override
>>>>              public HashFunctionIdentity getHashFunctionIdentity() {
>>>>                  return shape.getHashFunctionIdentity();
>>>>              }
>>>> 
>>>>              @Override
>>>>              public boolean isEmpty() {
>>>>                  return values.length == 0;
>>>>              }
>>>>          };
>>>>      }
>>>> 
>>>> 
>>>>> It might make more sense to reimplement the StaticHasher using a long[]
>>>>> internally (like the BloomFilter.getBits() method returns).  We have
>> the
>>>>> shape so we know the maximum length and we could trim it after loading
>> to
>>>>> remove trailing zero value longs.  This would remove the necessity of
>>>>> converting the iterator into a set of some sort.  We can do the limit
>>>>> checks quickly as we turn bits on.
>>>> For small numbers of indices the long[] format is likely to be less
>>>> space efficient.
>>>>> Makes me think we might need to implement StandardBloomFilter to use
>>>> long[]
>>>>> as well.
>>>> That is just the BitSetBloomFilter. So if you use this format in the
>>>> StaticHasher then you have no advantage over actually creating a
>>>> BitSetBloomFilter.
>>>> 
>>>> The StaticHasher is returned by the BloomFilter interface. You also have
>>>> a method to return the bits using the long[] format. So there is
>>>> redundancy here. To drop the redundancy I think you separate the uses
>>>> for the canonical representation of the entire filter as a long[] and
>>>> the uses for the set bits. If the later is satisfied by the StaticHasher
>>>> then it should be replaced by an iterator instead.
>>>> 
>>>> Looking at the code the getHasher() method in BloomFilter is only used
>>>> in HasherBloomFilter. This is to add more enabled bits to the current
>>>> set it contains. It seems that this use case is better served by using a
>>>> Set as the backing storage.
>>>> 
>>>> This leaves the StaticHasher with no usage in the current library. It is
>>>> not as fast as wrapping a provided array. It offers no expansion of
>>>> storage for use in the HasherBloomFilter. It does however provide the
>>>> smallest form for a compact BloomFilter representation. So it could be
>>>> renamed to ImmutableHasher.
>>>> 
>>>> I would think it best to consolidate all of this:
>>>> 
>>>> - StaticHasher to stay as an immutable instance of sorted bit indices.
>>>> The intention is an efficient compact form for a few bits of a
>>>> BloomFilter than can be merged efficiently. It could be used with
>>>> pre-exisiting indices but would have to eliminate duplicates and sort
>> them.
>>>> - Remove the redundant StaticHasher constructor accepting a StaticHasher
>>>> instance since they are immutable.
>>>> - Possibly add a method to StaticHasher to merge another StaticHasher to
>>>> create a new instance.
>>>> - Possibly add a method to StaticHasher to get the maximum index in the
>>>> set (it already has size()). This helps convert it to a long[] or any
>>>> other class that will call getBits() and use the iterator output.
>>>> 
>>>> - BloomFilter to remove the getHasher() method. The long[] getBits()
>>>> method to be documented as returning the canonical format of the filter.
>>>> The getHasher() method is just duplicating this information in another
>>>> format.
>>>> - BloomFilter to have an iterator() method to get a
>>>> PrimitiveIterator.OfInt. This replaces the getHasher() method which just
>>>> provided access to the Shape and the PrimitiveIterator.OfInt of the
>>>> Hasher. BloomFilter already has getShape() so it just needs an iterator.
>>>> 
>>>> - HasherBloomFilter to use the iterator() method when merging a generic
>>>> BloomFilter. It can detect instances of HasherBloomFilter and switch to
>>>> a merge operation of its internal representation.
>>>> - HasherBloomFilter to use different backing storage (e.g. a
>>>> Set<Integer>) or adapt StaticHasher to be more efficient for merge.
>>>> 
>>>> - Add a new class akin to StaticHasher that allows duplicates. Document
>>>> it as a fast way to create a few bits of a BloomFilter.
>>>> 
>>>> You then have:
>>>> 
>>>> Hasher
>>>> 
>>>> - StaticHasher - immutable set of ordered non-duplicate bits. Can be
>>>> merged with another instance.
>>>> 
>>>> - DynamicHasher - generates bits on the fly using a hashing function
>>>> 
>>>> - FixedHasher - an array of indices. Can wrap an input array or be
>>>> created using constructors similar to StaticHasher but without the
>>>> guarantees of non-duplicates.
>>>> 
>>>> 
>>>> Lots to think about there.
>>>> 
>>>>> Claude
>>>>> 
>>>>> On Mon, Mar 2, 2020 at 1:12 PM Alex Herbert <alex.d.herb...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> On 02/03/2020 11:32, Claude Warren wrote:
>>>>>>> my thought on changing the BloomFilter.merge() to return a boolean is
>>>>>> along
>>>>>>> the lines you had: return true on successful merge (even if there are
>>>> no
>>>>>>> changes in the enabled bits). And yes, for most cases the standard
>>>> bloom
>>>>>>> filter will return true, but the change is really to support
>> extensions
>>>>>> to
>>>>>>> Bloom filter so I think it is reasonable.
>>>>>>> 
>>>>>>> As for the getCounts().  Suppose we split it into two methods:
>>>>>>> 
>>>>>>>       // return the indexes that are enabled.  This is equivalent to
>>>>>>> CountingBloomFilter.getHasher().getBits(
>> CountingBloomFilter.getShape()
>>>>>> );
>>>>>>>       Iterator<int> getIndexes()
>>>>>>>       // get the count for the specific index.
>>>>>>>       int getCount( int index );
>>>>>>> 
>>>>>>> With these methods It becomes possible to construct an iterator of
>>>> int[]
>>>>>> or
>>>>>>> Map.Entry<Integer,Integer> or whatever else the developer wants.
>>>>>> Having to call getCount() for every index is likely to be slower than
>>>>>> all the garbage collection involved with iterating a disposable
>> <index,
>>>>>> count> pair. For an array backed storage the access will be order(1).
>>>>>> But other storage may be a lot slower. This method ideally should
>>>>>> traverse the storage once and make each <index, count> pair available.
>>>>>> Ideally this would be without creating each <index, count> pair as a
>>>>>> disposable object.
>>>>>> 
>>>>>> Following on from previously, another disadvantage of the consumer
>>>>>> approach is the lack of fast fail. You can remedy this by adding a
>>>>>> boolean return from the consumer to indicate if you want to consume
>> more
>>>>>> items:
>>>>>> 
>>>>>> interface BitCountConsumer {
>>>>>>       boolean accept(int index, int count);
>>>>>> }
>>>>>> 
>>>>>> A CountingBloomFilter implementation then has something like:
>>>>>> 
>>>>>>       void getCounts(BitCountConsumer consumer) {
>>>>>>           while(hasNext()) {
>>>>>>               next();
>>>>>>               if (!consumer.accept(index(), count()) {
>>>>>>                   break;
>>>>>>               }
>>>>>>           }
>>>>>>       }
>>>>>> 
>>>>>> Looking at the getHasher().getBits(...) idea I noticed that
>> StaticHasher
>>>>>> is constructed using:
>>>>>> 
>>>>>> Iterator<Integer> iter
>>>>>> 
>>>>>> not:
>>>>>> 
>>>>>> PrimitiveIterator.OfInt
>>>>>> 
>>>>>> I'd prefer the primitive specialisation in this constructor. It would
>>>>>> break the HasherBloomFilter merge function but that can be fixed.
>>>>>> However it may be redundant. The StaticHasher has some oddities with
>> the
>>>>>> available constructors:
>>>>>> 
>>>>>> public StaticHasher(final StaticHasher hasher, final Shape shape)
>>>>>> 
>>>>>> - why have a shape parameter? The constructor checks the shape is the
>>>>>> same as that for the hasher. So why not just a plain copy constructor:
>>>>>> 
>>>>>> public StaticHasher(final StaticHasher hasher)
>>>>>> 
>>>>>> This constructor is only used in the unit tests. Given the
>> StaticHasher
>>>>>> is immutable then a copy constructor seems odd. I think it should be
>>>>>> dropped.
>>>>>> 
>>>>>> 
>>>>>> Here are is a constructor that is missing IMO:
>>>>>> 
>>>>>> public StaticHasher(final Set<Integer> indices, final Shape shape)
>>>>>> 
>>>>>> - Could be used to generically create a static hasher. The indices are
>>>>>> ensured to be unique by the Set but should be checked to be within the
>>>>>> shape.
>>>>>> 
>>>>>> 
>>>>>> With these constructors the public StaticHasher(final
>> Iterator<Integer>
>>>>>> iter, final Shape shape) is only used by the BitSetBloomFilter:
>>>>>> 
>>>>>>       public StaticHasher getHasher() {
>>>>>>           return new StaticHasher(bitSet.stream().iterator(),
>>>> getShape());
>>>>>>       }
>>>>>> 
>>>>>> Becomes:
>>>>>> 
>>>>>>       public StaticHasher getHasher() {
>>>>>>           return new
>>>>>> 
>>>> 
>> StaticHasher(bitSet.stream().boxed().collect(Collectors.toCollection(TreeSet::new)),
>>>>>> getShape());
>>>>>>       }
>>>>>> 
>>>>>> This is a very roundabout way to construct a StaticHasher. Perhaps a
>>>>>> constructor accepting a BitSet would be useful?
>>>>>> 
>>>>>> So the API only requires:
>>>>>> 
>>>>>> public StaticHasher(final Hasher hasher, final Shape shape)
>>>>>> public StaticHasher(final Set<Integer> indices, final Shape shape)
>>>>>> // Maybe
>>>>>> public StaticHasher(final BitSet indices, final Shape shape)
>>>>>> 
>>>>>> 
>>>>>> I wondered about a merge function for the StaticHasher:
>>>>>> 
>>>>>> public StaticHasher merge(final StaticHasher other)
>>>>>> 
>>>>>> The merge function can make some assumptions about the two arrays to
>>>>>> merge as the class is final and the values are verified to be sorted
>> and
>>>>>> within the Shape. However there is not a use for it other than in the
>>>>>> merge of two HasherBloomFilters, which is not a recommended operation
>>>>>> anyway.
>>>>>> 
>>>>>> 
>>>>>>> Claude
>>>>>>> 
>>>>>>> On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert <
>> alex.d.herb...@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On 02/03/2020 09:38, Claude Warren wrote:
>>>>>>>>> It is not too late to update the BloomFIlter interface to have the
>>>>>> merge
>>>>>>>>> return a boolean.  The incorrect Shape would still throw an
>>>> exception,
>>>>>> so
>>>>>>>>> the return value would only come into play if the bits could not be
>>>>>> set.
>>>>>>>>> thoughts?
>>>>>>>> I don't see the harm in it. But what would the return value be for?
>>>>>>>> 
>>>>>>>> For a standard collection it would be if the collection was changed
>> by
>>>>>>>> the operation:
>>>>>>>> 
>>>>>>>> Collection.add/remove return "true if this collection changed as a
>>>>>>>> result of the call"
>>>>>>>> 
>>>>>>>> So here is the equivalent:
>>>>>>>> 
>>>>>>>> return "true if this filter was changed as a result of the call"
>>>>>>>> 
>>>>>>>> This is computationally slow to track. It also is confusing if the
>>>>>>>> filter was successfully merged but no bits were changed to then
>> return
>>>>>>>> false because the filter was actually incorporated. So it would go
>>>> along
>>>>>>>> the lines that we discussed for the counting Bloom filter:
>>>>>>>> 
>>>>>>>> return "true if this filter was successfully merged as a result of
>> the
>>>>>>>> call"
>>>>>>>> 
>>>>>>>> For most cases in the current library it would be true when an
>>>> exception
>>>>>>>> is not thrown. However the merge of the counting Bloom filter may
>> have
>>>>>>>> reason to return false, e.g. overflow.
>>>>>>>> 
>>>>>>>>> On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <cla...@xenei.com>
>>>> wrote:
>>>>>>>>>> for the remove(), add(), and subtract() methods I agree that void
>> is
>>>>>> not
>>>>>>>>>> correct and it should be boolean and be the same as the value you
>>>>>> would
>>>>>>>> get
>>>>>>>>>> from calling isValid().
>>>>>>>>>> 
>>>>>>>>>> You are correct the getCounts() should return an iterator of some
>>>> type
>>>>>>>> on
>>>>>>>>>> int[], I don't know why I thought long[].  I am happy with a plain
>>>>>>>>>> Iterator<int[]> as the return.
>>>>>>>> For the getCounts() method I am still looking for a way around
>> having
>>>> to
>>>>>>>> create an <index, count> pair for everything in the filter. An
>>>>>>>> alternative to an iterator is to use the consumer idea. Given there
>> is
>>>>>>>> no primitive specialisation of BiConsumer<T, U> in JDK 8 functions
>> we
>>>>>>>> define our own:
>>>>>>>> 
>>>>>>>> interface BitCountConsumer {
>>>>>>>>        void accept(int index, int count);
>>>>>>>> }
>>>>>>>> 
>>>>>>>> The CountingBloomFilter then has:
>>>>>>>> 
>>>>>>>>        void forEachCount(BitCountConsumer consumer);
>>>>>>>>        // Or
>>>>>>>>        void getCounts(BitCountConsumer consumer);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> You can then directly pass the counts from the backing storage to
>> the
>>>>>>>> destination.
>>>>>>>> 
>>>>>>>> Advantages:
>>>>>>>> - No object creation
>>>>>>>> Disadvantages
>>>>>>>> - The counts cannot be streamed
>>>>>>>> 
>>>>>>>> An alternative is to provide an Iterator of long with the index and
>>>>>>>> count packed into a long with methods to extract them:
>>>>>>>> 
>>>>>>>>       PrimativeIterator.OfLong getCounts();
>>>>>>>> 
>>>>>>>>       default static int getIndex(long indexCountPair) {
>>>>>>>>            return (int) (indexCountPair >>> 32);
>>>>>>>>       }
>>>>>>>> 
>>>>>>>>       default static int getCount(long indexCountPair) {
>>>>>>>>            return (int) indexCountPair;
>>>>>>>>       }
>>>>>>>> 
>>>>>>>> This will almost certainly be a cause for bugs/issues from users.
>>>>>>>> 
>>>>>>>> I believe that the counts will be used for 2 main use cases:
>>>>>>>> 
>>>>>>>> 1. Storage
>>>>>>>> 
>>>>>>>> 2. Adding to another counting Bloom filter
>>>>>>>> 
>>>>>>>> Both cases are likely to be done serially and not in parallel. So
>>>>>>>> providing a consumer based API to receive the counts would work.
>>>>>>>> 
>>>>>>>> WDYT?
>>>>>>>> 
>>>>>>>>>> Claude
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert <
>>>> alex.d.herb...@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>>> On 1 Mar 2020, at 15:39, Claude Warren <cla...@xenei.com>
>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the CountingBloomFilter interface needs to extend
>>>>>> BloomFilter.
>>>>>>>>>>> I said that but did not write it, sorry.
>>>>>>>>>>> 
>>>>>>>>>>>> I think I am confused.
>>>>>>>>>>>> 
>>>>>>>>>>>> I would expect CountingBloomFilter to have
>>>>>>>>>>>> 
>>>>>>>>>>>> interface CountingBloomFilter extends BloomFilter {
>>>>>>>>>>>>       // these 2 methods are the reverse of merge()
>>>>>>>>>>>>       void remove( BloomFilter );
>>>>>>>>>>>>       void remove( Hasher );
>>>>>>>>>>> Fine. Same intention but different method names. But why void? It
>>>>>>>> forces
>>>>>>>>>>> you to check if the remove was valid with a second call. On the
>>>> plus
>>>>>>>> side
>>>>>>>>>>> it matches the void merge(…) methods and in most cases a user
>> would
>>>>>> not
>>>>>>>>>>> care to check anyway. If they are controlling the filter then
>> leave
>>>>>> it
>>>>>>>> to
>>>>>>>>>>> them to make sure they do not remove something they did not add.
>>>>>>>>>>> 
>>>>>>>>>>>>       // these 2 methods are the addition/subtraction of counts
>>>>>>>>>>>>       void add( CountingBloomFilter )
>>>>>>>>>>>>       void subtract( CountingBloomFilter );
>>>>>>>>>>> Fine. But same comment as above with the void return.
>>>>>>>>>>> 
>>>>>>>>>>>>       // 2 methods to retrieve data
>>>>>>>>>>>>       Stream<long[]> getCounts();
>>>>>>>>>>> I don’t like this use of long[]. In my previous post I argued
>> that
>>>> if
>>>>>>>> you
>>>>>>>>>>> were to ever want to store more than max integer items in a
>> filter
>>>>>>>> then the
>>>>>>>>>>> Bloom filter would have more bit indices than max integer. So we
>>>>>> never
>>>>>>>> have
>>>>>>>>>>> to support long counts. A filter that exceeds max integer for a
>>>> count
>>>>>>>> is
>>>>>>>>>>> highly likely to be saturated and no use as a filter anyway.
>>>>>>>>>>> 
>>>>>>>>>>> For most backing implementations the object type of the stream
>> will
>>>>>> be
>>>>>>>>>>> different so you will have to write a Spliterator<T>
>> implementation
>>>>>> or
>>>>>>>> wrap
>>>>>>>>>>> some iterator anyway. So why not return the Spliterator:
>>>>>>>>>>> 
>>>>>>>>>>>       Spliterator<int[]> getCounts();
>>>>>>>>>>> 
>>>>>>>>>>> Since the backing implementation will likely not store int[]
>> pairs
>>>>>> then
>>>>>>>>>>> this will have a lot of object creation and garbage collection
>>>>>>>> overhead to
>>>>>>>>>>> go through the counts. This does not seem to be a big concern
>> here
>>>> if
>>>>>>>> the
>>>>>>>>>>> purpose is the same as for the BloomFilter for long[] getBits(),
>>>> i.e.
>>>>>>>> to
>>>>>>>>>>> get a canonical representation for storage.
>>>>>>>>>>> 
>>>>>>>>>>> Note: The Spliterator may not have a known size (number of non
>> zero
>>>>>>>> bits)
>>>>>>>>>>> at creation, for example if the counts are stored in a fixed size
>>>>>>>> array.
>>>>>>>>>>> Thus efficient parallel traversal by binary tree splitting is
>>>> limited
>>>>>>>> by
>>>>>>>>>>> how evenly the counts are distributed. For a backing
>> implementation
>>>>>>>> using a
>>>>>>>>>>> collection then the size should be known. In this case a
>>>> Spliterator
>>>>>>>> would
>>>>>>>>>>> be of more use than a plain Iterator. You can convert one to the
>>>>>> other
>>>>>>>>>>> using:
>>>>>>>>>>> 
>>>>>>>>>>> java.util.Spliterators:
>>>>>>>>>>> public static<T> Iterator<T> iterator(Spliterator<? extends T>
>>>>>>>>>>> spliterator)
>>>>>>>>>>> public static <T> Spliterator<T> spliterator(Iterator<? extends
>> T>
>>>>>>>>>>> iterator,
>>>>>>>>>>>                                                     long size,
>>>>>>>>>>>                                                     int
>>>>>> characteristics)
>>>>>>>>>>> So which to choose for the API?
>>>>>>>>>>> 
>>>>>>>>>>> The Hasher currently uses an Iterator:
>>>>>>>>>>> 
>>>>>>>>>>> PrimitiveIterator.OfInt getBits(Shape shape);
>>>>>>>>>>> 
>>>>>>>>>>> In the case of a StaticHasher this could return a spliterator.
>> But
>>>>>> the
>>>>>>>>>>> DynamicHasher needs a reworking of the internal Iterator class.
>> It
>>>>>>>> could be
>>>>>>>>>>> a Spliterator to use the new IntConsumer API but in most (all)
>>>> cases
>>>>>>>>>>> splitting would not be possible for dynamic hashing as the parts
>>>> are
>>>>>>>>>>> produced in order. It is likely that they will be consumed
>>>>>>>> sequentially too.
>>>>>>>>>>> I would suggest that Spliterator is the more modern
>> implementation,
>>>>>>>>>>> despite not always being applicable to parallelisation in a
>> stream.
>>>>>>>>>>> Currently the iterator from the Hasher is used in
>>>> forEachRemaining()
>>>>>>>> and
>>>>>>>>>>> while loop is approximately equal measure. The while loops are
>> for
>>>> a
>>>>>>>> fast
>>>>>>>>>>> exit and would be uglier if rewritten for a
>>>>>>>>>>> Spliterator.tryAdvance(IntConsumer) syntax.
>>>>>>>>>>> 
>>>>>>>>>>> There is a use of the IteratorChain in HasherBloomFilter that
>> would
>>>>>>>> need
>>>>>>>>>>> a rethink for spliterators.
>>>>>>>>>>> 
>>>>>>>>>>> The path of least resistance is to use Iterator<int[]> for the
>> API
>>>> of
>>>>>>>>>>> CountingBloomFilter to be consistent with Hasher’s use of
>> Iterator.
>>>>>>>>>>> 
>>>>>>>>>>> WDYT?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>       boolean isValid()
>>>>>>>>>>> Fine. Allows some level of feedback that the counts are corrupt.
>>>>>>>>>>> 
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> Claude
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <
>>>>>> alex.d.herb...@gmail.com
>>>>>>>>>>> <mailto:alex.d.herb...@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com>
>>>> wrote:
>>>>>>>>>>>>>> The idea of a backing array is fine and the only problem I see
>>>>>> with
>>>>>>>>>>> it is
>>>>>>>>>>>>>> in very large filters (on the order of 10^8 bits and larger)
>> but
>>>>>>>>>>> document
>>>>>>>>>>>>>> the size calculation and let the developer worry about it.
>>>>>>>>>>>>> Let us look at the use case where we max out the array. Using
>> the
>>>>>>>> Bloom
>>>>>>>>>>>>> filter calculator:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> n = 149,363,281
>>>>>>>>>>>>> p = 0.001000025 (1 in 1000)
>>>>>>>>>>>>> m = 2147483647 (256MiB)
>>>>>>>>>>>>> k = 10
>>>>>>>>>>>>> 
>>>>>>>>>>>>> n = 74,681,641
>>>>>>>>>>>>> p = 0.000001 (1 in 999950)
>>>>>>>>>>>>> m = 2147483647 (256MiB)
>>>>>>>>>>>>> k = 20
>>>>>>>>>>>>> 
>>>>>>>>>>>>> n = 49,787,761
>>>>>>>>>>>>> p = 0.000000001 (1 in 999924899)
>>>>>>>>>>>>> m = 2147483647 (256MiB)
>>>>>>>>>>>>> k = 30
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So you will be able to put somewhere in the order of 10^8 or
>> 10^7
>>>>>>>> items
>>>>>>>>>>>>> into the filter. I would say that anyone putting more than that
>>>>>> into
>>>>>>>>>>> the
>>>>>>>>>>>>> filter has an unusual use case. The CountingBloomFilter can
>> throw
>>>>>> an
>>>>>>>>>>>>> exception if m is too large and will throw an OutOfMemoryError
>> if
>>>>>> you
>>>>>>>>>>>>> cannot allocate an array large enough.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One clear point here is that you cannot use a short as a 16-bit
>>>>>> count
>>>>>>>>>>>>> would easily overflow. So you have to use an integer array for
>>>> the
>>>>>>>>>>> counts.
>>>>>>>>>>>>> A maximum length int[] is roughly 8GB.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What would another implementation cost in terms of memory? The
>>>>>>>>>>>>> TreeMap<Integer, Integer> was the most space efficient. In the
>>>>>>>> previous
>>>>>>>>>>>>> e-mail the saturation of a Bloom filter bits was approximately
>>>> 50%
>>>>>>>>>>> when at
>>>>>>>>>>>>> the intended capacity. So we have to estimate the size of a
>>>> TreeMap
>>>>>>>>>>>>> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test
>>>>>> shows
>>>>>>>>>>> the
>>>>>>>>>>>>> TreeMap memory scales linearly with entries:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =
>>>> 1834061
>>>>>>>>>>> bytes
>>>>>>>>>>>>> 65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =
>>>> 3669080
>>>>>>>>>>> bytes
>>>>>>>>>>>>> 131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =
>>>> 7339090
>>>>>>>>>>> bytes
>>>>>>>>>>>>> So what is the memory for a TreeMap with 2^30 indices. I make
>> it
>>>>>>>> about:
>>>>>>>>>>>>> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would say that this amount of RAM is unusual. It is
>> definitely
>>>>>> not
>>>>>>>> as
>>>>>>>>>>>>> efficient as using an array. So very large counting Bloom
>> filters
>>>>>> are
>>>>>>>>>>> going
>>>>>>>>>>>>> to require some thought as to the hardware they run on. This
>> may
>>>>>> not
>>>>>>>>>>> be the
>>>>>>>>>>>>> case in 10 years time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would say that we try an int[] backing array for the storage
>>>>>>>>>>>>> implementation and document it’s limitations. A different
>>>>>>>>>>> implementation
>>>>>>>>>>>>> could be provided in future if required.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This could be done by making CountingBloomFilter an interface
>>>> that
>>>>>>>>>>> extends
>>>>>>>>>>>>> BloomFilter with the methods:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> subtract(BloomFilter filter)
>>>>>>>>>>>>> subtract(Hasher filter)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> These will negate the effect of the corresponding
>>>>>> merge(BloomFilter)
>>>>>>>>>>>>> operation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Do we also need access to the counts and add/subtract of
>> another
>>>>>>>>>>>>> CountingBloomFilter?:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> add(CountingBloomFilter filter);
>>>>>>>>>>>>> subtract(CountingBloomFilter filter);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Iterator<int[]> getCounts();
>>>>>>>>>>>>> int getSize(); // Number of items added
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The CountingBloomFilter is then an interface that defines how
>> to
>>>>>>>>>>> reverse
>>>>>>>>>>>>> the merge of some bits into the filter.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My concern is the inefficiency of the creation of objects in
>> any
>>>>>>>> method
>>>>>>>>>>>>> that provides access to the counts (e.g. above using an
>> iterator
>>>> as
>>>>>>>> for
>>>>>>>>>>>>> Hasher.getBits()). I presume this method would be to allow some
>>>>>> type
>>>>>>>> of
>>>>>>>>>>>>> storage/serialisation of the filter akin to the long[]
>> getBits()
>>>>>>>>>>> method of
>>>>>>>>>>>>> BloomFilter. So it may be better done using a method:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> int getCount(int index);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The caller can then use long[] getBits() to get the indices set
>>>> in
>>>>>>>> the
>>>>>>>>>>>>> filter and then for each non-zero bit index call
>> getCount(index).
>>>>>> Or
>>>>>>>>>>> just
>>>>>>>>>>>>> not include the method as the counts are only of concern when
>>>>>> storing
>>>>>>>>>>> the
>>>>>>>>>>>>> filter. This functionality is cleaner pushed into an
>>>>>> implementation.
>>>>>>>>>>>>> In a previous post we discussed whether to throw an exception
>> on
>>>>>>>>>>>>> overflow/underflow or raise in an invalid flag. Using the
>> invalid
>>>>>>>> flag
>>>>>>>>>>> idea
>>>>>>>>>>>>> the interface would be:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> interface CountingBloomFilter {
>>>>>>>>>>>>>       int add(CountingBloomFilter filter);
>>>>>>>>>>>>>       int subtract(BloomFilter filter);
>>>>>>>>>>>>>       int subtract(Hasher filter);
>>>>>>>>>>>>>       int subtract(CountingBloomFilter filter);
>>>>>>>>>>>>>       int getStatus();
>>>>>>>>>>>>> 
>>>>>>>>>>>>>       // Maybe
>>>>>>>>>>>>>       int getSize();
>>>>>>>>>>>>>       int getCount(int index);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The status would be negative if any operation
>>>>>> overflowed/underflowed,
>>>>>>>>>>> or
>>>>>>>>>>>>> zero if OK. The current status is returned by the add/subtract
>>>>>>>> methods.
>>>>>>>>>>>>> However I note that overflow may not be a concern. The number
>> of
>>>>>>>> items
>>>>>>>>>>> to
>>>>>>>>>>>>> add to a filter to create overflow would be using a filter
>> with a
>>>>>>>>>>> number of
>>>>>>>>>>>>> bits that is unrealistic to store in memory:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> n = 2147483647
>>>>>>>>>>>>> p = 0.001000025 (1 in 1000)
>>>>>>>>>>>>> m = 30875634182 (3.59GiB)
>>>>>>>>>>>>> k = 10
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you want to add 2 billion items (and overflow an integer
>>>> count)
>>>>>>>> then
>>>>>>>>>>>>> your filter would be so big it would break the rest of the API
>>>> that
>>>>>>>>>>> uses a
>>>>>>>>>>>>> 32-bit int for the bit index.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thus only underflow is a realistic concern. This could be
>>>>>> documented
>>>>>>>> as
>>>>>>>>>>>>> handled in an implementation specific manner (i.e. throw or
>>>>>> ignore).
>>>>>>>>>>> The
>>>>>>>>>>>>> API is then simplified to:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> interface CountingBloomFilter {
>>>>>>>>>>>>>       boolean add(CountingBloomFilter filter);
>>>>>>>>>>>>>       boolean subtract(BloomFilter filter);
>>>>>>>>>>>>>       boolean subtract(Hasher filter);
>>>>>>>>>>>>>       boolean subtract(CountingBloomFilter filter);
>>>>>>>>>>>>>       int getStatus();
>>>>>>>>>>>>> 
>>>>>>>>>>>>>       // Maybe
>>>>>>>>>>>>>       int getSize();
>>>>>>>>>>>>>       int getCount(int index);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The boolean is used to state that add/subtract did not
>>>>>>>> over/underflow.
>>>>>>>>>>>>> Implementations can throw if they require it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The question then becomes what does getSize() represent if an
>>>>>>>>>>> add/subtract
>>>>>>>>>>>>> did not execute cleanly. Under this scheme it would be the
>> number
>>>>>> of
>>>>>>>>>>> (add -
>>>>>>>>>>>>> subtract) operations. The status flag would be used to indicate
>>>> if
>>>>>>>> the
>>>>>>>>>>> size
>>>>>>>>>>>>> is valid, or any of the counts from getCount(). The simpler API
>>>> is
>>>>>> to
>>>>>>>>>>> not
>>>>>>>>>>>>> allow access to counts/size or adding/subtracting counts:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> interface CountingBloomFilter {
>>>>>>>>>>>>>       boolean subtract(BloomFilter filter);
>>>>>>>>>>>>>       boolean subtract(Hasher filter);
>>>>>>>>>>>>>       int getStatus();
>>>>>>>>>>>>>       // Or something like ...
>>>>>>>>>>>>>       boolean isValid();
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This filter is then only concerned with reversing the merge of
>>>>>> Bloom
>>>>>>>>>>>>> filters with a valid status flag to indicate that the current
>>>> state
>>>>>>>> is
>>>>>>>>>>>>> consistent (i.e. all filters have been cleanly
>>>> merged/subtracted).
>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As for the merge question.  merge is a standard bloom filter
>>>>>>>>>>> operation.
>>>>>>>>>>>>> It
>>>>>>>>>>>>>> is well defined in the literature.  Merging a bloom filter
>> into
>>>> a
>>>>>>>>>>>>> counting
>>>>>>>>>>>>>> bloom filter means incrementing the bit counts.  I  think that
>>>>>>>>>>>>> merge/remove
>>>>>>>>>>>>>> should continue to operate as though the parameter were a
>>>> standard
>>>>>>>>>>> bloom
>>>>>>>>>>>>>> filter.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> OK. So the count is to represent the number of filters that
>> had a
>>>>>> bit
>>>>>>>>>>> set
>>>>>>>>>>>>> at that index. This makes it more clear.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We had spoken of implementing and adding/deleting method pair
>>>> that
>>>>>>>>>>> would
>>>>>>>>>>>>>> operate on CountingBloomFilters and would add/subtract the
>>>> counts.
>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>> add(CountingBloomFilter) and subtract(CountingBloomFilter))
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I disagree with your proposal for the merge(Hasher)
>>>>>> implementation,
>>>>>>>>>>> and I
>>>>>>>>>>>>>> am not certain that an add(Hasher) makes sense.  First
>> consider
>>>>>> that
>>>>>>>>>>> the
>>>>>>>>>>>>>> Hasher returns the bits that are to be enabled in the Bloom
>>>> filter
>>>>>>>> so
>>>>>>>>>>>>>> collisions are expected.  In the normal course of events a
>>>> Hasher
>>>>>> is
>>>>>>>>>>> used
>>>>>>>>>>>>>> to create a normal Bloom filter where all the duplicates are
>>>>>>>> removed.
>>>>>>>>>>>>> That
>>>>>>>>>>>>>> filter is then merged into a CountingBloomFilter.  So in some
>>>>>> sense
>>>>>>>>>>> the
>>>>>>>>>>>>>> Hasher and the normal Bloom filter are the same.  So I would
>>>>>> expect
>>>>>>>>>>> the
>>>>>>>>>>>>>> merge of the Hasher and the merge of the normal Bloom filter
>>>>>> created
>>>>>>>>>>> from
>>>>>>>>>>>>>> that hasher into a CountingBloomFilter to yield the same
>> result.
>>>>>> If
>>>>>>>>>>> you
>>>>>>>>>>>>>> wanted to add an add(Hasher)/delete(Hasher) pair of functions
>>>> to a
>>>>>>>>>>>>>> CountingBloomFilter you could implement with duplicate
>> counting,
>>>>>> but
>>>>>>>>>>> I am
>>>>>>>>>>>>>> not certain of the validity of such a count and I fear that it
>>>>>>>> muddies
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> waters with respect to what the CountingBloomFilter is
>> counting.
>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Claude
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <
>>>>>>>>>>> alex.d.herb...@gmail.com
>>>>>>>>>>>>> <mailto:alex.d.herb...@gmail.com <mailto:
>>>> alex.d.herb...@gmail.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com
>>>>>>>> <mailto:
>>>>>>>>>>> cla...@xenei.com> <mailto:
>>>>>>>>>>>>> cla...@xenei.com <mailto:cla...@xenei.com>>> wrote:
>>>>>>>>>>>>>>>> Alex would you take a look at pull request 131 [1].  it
>> adds a
>>>>>> new
>>>>>>>>>>>>> hasher
>>>>>>>>>>>>>>>> implementation and makes the HashFunctionValidator available
>>>> for
>>>>>>>>>>> public
>>>>>>>>>>>>>>> use.
>>>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131> <
>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131>> <
>>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131> <
>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 <
>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131>>>
>>>>>>>>>>>>>>> OK. I’ll take a look.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’ve been thinking about the counting Bloom filter and the
>>>>>> backing
>>>>>>>>>>>>>>> storage. In summary:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. The backing storage should be a fixed array.
>>>>>>>>>>>>>>> 2. Merging a Hasher should count duplicate indices, not
>> flatten
>>>>>>>> them
>>>>>>>>>>> all
>>>>>>>>>>>>>>> to a single count.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For background I’ve used the standard formulas to estimate
>> the
>>>>>>>>>>> number of
>>>>>>>>>>>>>>> indices that will be non-zero in a Bloom filter. The
>> wikipedia
>>>>>> page
>>>>>>>>>>>>> gives
>>>>>>>>>>>>>>> this formula for the expected number of bits set to 0 (E(q))
>> if
>>>>>> you
>>>>>>>>>>> have
>>>>>>>>>>>>>>> inserted i elements into a filter of size m using k hash
>>>>>> functions:
>>>>>>>>>>>>>>> E(q) = (1 - 1/m)^ki"
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So a rough guess of the number of indices (bits) used by a
>>>> filter
>>>>>>>> is
>>>>>>>>>>>>>>> 1-E(q).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here is a table of Bloom filters with different collision
>>>>>>>>>>> probabilities
>>>>>>>>>>>>>>> and the proportion of bits that will be set when 1%, 10%,
>> 100%
>>>> of
>>>>>>>> the
>>>>>>>>>>>>>>> capacity of the filter has been met:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> n       p       m       k       I       E(q)    bits
>>>>>>>>>>>>>>> 1000    1E-04   19171   13      10      0.9932  0.0068
>>>>>>>>>>>>>>> 1000    1E-04   19171   13      100     0.9344  0.0656
>>>>>>>>>>>>>>> 1000    1E-04   19171   13      1000    0.5076  0.4924
>>>>>>>>>>>>>>> 1000    1E-05   23963   17      10      0.9929  0.0071
>>>>>>>>>>>>>>> 1000    1E-05   23963   17      100     0.9315  0.0685
>>>>>>>>>>>>>>> 1000    1E-05   23963   17      1000    0.4919  0.5081
>>>>>>>>>>>>>>> 1000    1E-06   28756   20      10      0.9931  0.0069
>>>>>>>>>>>>>>> 1000    1E-06   28756   20      100     0.9328  0.0672
>>>>>>>>>>>>>>> 1000    1E-06   28756   20      1000    0.4988  0.5012
>>>>>>>>>>>>>>> 10000   1E-06   287552  20      100     0.9931  0.0069
>>>>>>>>>>>>>>> 10000   1E-06   287552  20      1000    0.9328  0.0672
>>>>>>>>>>>>>>> 10000   1E-06   287552  20      10000   0.4988  0.5012
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The point is that if you create a Bloom filter and fill it to
>>>> 10%
>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> intended capacity the number of indices used will be about
>> 6-7%
>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> filter bits.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So how to store the counts? Currently the counting bloom
>> filter
>>>>>>>> uses
>>>>>>>>>>> a
>>>>>>>>>>>>>>> TreeMap<Integer, Integer>. I tried:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> TreeMap<Integer, Integer>
>>>>>>>>>>>>>>> HashMap<Integer, Integer>
>>>>>>>>>>>>>>> TreeSet<MutableCount>
>>>>>>>>>>>>>>> HashSet<MutableCount>
>>>>>>>>>>>>>>> int[]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The MutableCount is a custom object that stores the bit index
>>>> and
>>>>>>>>>>> uses
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> for a hash code and then has a mutable integer count field.
>> It
>>>>>>>> allows
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> count to be incremented/decremented if the object is in the
>>>> set:
>>>>>>>>>>>>>>>      static final class MutableCount implements
>>>>>>>>>>> Comparable<MutableCount> {
>>>>>>>>>>>>>>>          final int key;
>>>>>>>>>>>>>>>          int count;
>>>>>>>>>>>>>>>          // etc
>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This is adapted from the Bag<T> collection which stores an
>> item
>>>>>>>> count
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> a MutableInteger. Here the mutable count is part of the same
>>>>>>>> object T
>>>>>>>>>>>>>>> inserted into the Set. So you can find the object, change the
>>>>>> count
>>>>>>>>>>> and
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> have to put it back into the set. This is more efficient than
>>>>>>>>>>> changing
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> Integer stored in a Map.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’ve estimated memory usage using an idea based on this
>> article
>>>>>>>> from
>>>>>>>>>>>>>>> JavaWorld: Java Tip 130: Do you know your data size? [1].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Basically you:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - create an object and throw it away. All classes are then
>>>>>>>>>>> initialised.
>>>>>>>>>>>>>>> - Then you free memory (run garbage collection) and get the
>>>>>> current
>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>> - Then create a lot of your object (held in an array)
>>>>>>>>>>>>>>> - Then measure memory usage again
>>>>>>>>>>>>>>> - memory = (after - before) / count
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here is some example output for n bits set in size m:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =
>>>>>>   733947
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =
>>>>>> 1467866
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =
>>>>>>   838928
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =
>>>>>> 1677776
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =
>>>>>> 1677712
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =
>>>>>> 2306739
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 13107 / 262144 (0.050) : HashSet<MutableCount>          =
>>>>>> 1782664
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> 26214 / 262144 (0.100) : HashSet<MutableCount>          =
>>>>>> 2516656
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>>       0 / 262144 (0.000) : int[]                          =
>>>>>> 1048608
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>>       0 / 262144 (0.000) : short[]                        =
>>>>>>   524320
>>>>>>>>>>> bytes
>>>>>>>>>>>>>>> The estimate is accurate to 0.0001% for the arrays so the
>>>> method
>>>>>> is
>>>>>>>>>>>>>>> working. The HashMap was created with the capacity set to the
>>>>>>>>>>> expected
>>>>>>>>>>>>>>> capacity of the filter (m).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’ve chosen these sizes because at 5% full a HashSet is less
>>>>>> memory
>>>>>>>>>>>>>>> efficient than using a fixed size array, and at 10% the
>> TreeSet
>>>>>> is
>>>>>>>>>>> also
>>>>>>>>>>>>>>> less efficient.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Note that the java.util.Tree/HashSet versions just wrap a Map
>>>> and
>>>>>>>>>>>>> insert a
>>>>>>>>>>>>>>> dummy object for all keys in the Map. So here a Set is not as
>>>>>>>>>>> efficient
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> a Map because in the Map test I always inserted the same
>>>> Integer
>>>>>>>>>>> object
>>>>>>>>>>>>>>> representing 1. This would be the same as using a Set with an
>>>>>>>> Integer
>>>>>>>>>>>>> key
>>>>>>>>>>>>>>> but here the Set had to contain the MutableCount which has an
>>>>>> extra
>>>>>>>>>>> int
>>>>>>>>>>>>>>> field and is larger than an Integer.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> These data lead me to think that a counting Bloom filter
>> should
>>>>>>>> just
>>>>>>>>>>>>> use a
>>>>>>>>>>>>>>> fixed size backing array:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - If created using the same Shape as a standard Bloom filter
>> it
>>>>>>>> uses
>>>>>>>>>>> a
>>>>>>>>>>>>>>> fixed size. This has high memory cost when the filter is
>> empty
>>>>>> but
>>>>>>>>>>> when
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> exceeds 10% of the intended capacity it is more efficient
>> than
>>>> a
>>>>>>>>>>> dynamic
>>>>>>>>>>>>>>> backing storage.
>>>>>>>>>>>>>>> - All operations will operate in order(n) time for an
>> operation
>>>>>>>> with
>>>>>>>>>>>>>>> another filter with n indices. Each individual index count in
>>>> the
>>>>>>>>>>> filter
>>>>>>>>>>>>>>> will have order(1) time for access/update. Performance will
>> be
>>>>>>>>>>> limited
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> the memory cache of the entire array.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The issue is that a counting Bloom filter with a very low
>>>> number
>>>>>> of
>>>>>>>>>>>>>>> inserted items will be memory inefficient. But under what
>>>>>>>>>>> circumstance
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> such a filter be used in a short-term lifecycle? If it is
>>>> simply
>>>>>> to
>>>>>>>>>>>>> merge
>>>>>>>>>>>>>>> into another filter then this can be done using a merge with
>> a
>>>>>>>>>>> Hasher.
>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> counts are to be merged then perhaps we provide a method to
>>>> merge
>>>>>>>>>>> counts
>>>>>>>>>>>>>>> using the same data structure returned by the
>>>> CountingBloomFilter
>>>>>>>>>>>>>>> getCounts() method, e.g. using a stream of <index,count>
>> pairs:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Stream<int[]> getCounts();
>>>>>>>>>>>>>>> void add(Stream<int[]> counts);
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The issue here is the Shape and HashFunctionIdentity of the
>>>>>> origin
>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> merge cannot be validated. So just leave it out and use the
>>>> merge
>>>>>>>>>>> with a
>>>>>>>>>>>>>>> Hasher.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thus the next issue with the counting Bloom filter
>>>>>> implementation.
>>>>>>>>>>>>>>> Currently when it merges with a Hasher it puts all the
>> indices
>>>>>>>> into a
>>>>>>>>>>>>> Set
>>>>>>>>>>>>>>> and so will only increment the count by 1 for each index
>>>>>> identified
>>>>>>>>>>> by
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> Hasher. This appears to miss the entire point of the counting
>>>>>> Bloom
>>>>>>>>>>>>> filter.
>>>>>>>>>>>>>>> If I hash an objects to generate k indices I would hope that
>> I
>>>> do
>>>>>>>>>>> get k
>>>>>>>>>>>>>>> indices. But the hash may not be perfect and I may get [1, k]
>>>>>>>> indices
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> some duplications. This is part of the signature of that
>> object
>>>>>>>> with
>>>>>>>>>>> the
>>>>>>>>>>>>>>> given hash. So surely a counting Bloom filter should
>>>> accommodate
>>>>>>>>>>> this.
>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> my Hasher generates the same index 20 times this should
>> result
>>>> in
>>>>>>>> the
>>>>>>>>>>>>> count
>>>>>>>>>>>>>>> of that index incrementing by 20.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The result if that if an object is added direct to a counting
>>>>>> Bloom
>>>>>>>>>>>>> filter
>>>>>>>>>>>>>>> using a Hasher it will have a different result that if added
>>>> to a
>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>> Bloom filter and then that filter added to the counting Bloom
>>>>>>>> filter.
>>>>>>>>>>>>>>> Opinions on this?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
>>>>>>>> <
>>>>>>>>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html>
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <
>>>>>>>>>>>>> alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I have created a PR that contains most of the changes
>>>> discussed
>>>>>>>> in
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> thread (see [1]).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Please review the changes and comment here or on GitHub.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I have left the CountingBloomFilter alone. I will
>> reimplement
>>>>>> the
>>>>>>>>>>>>>>>>> add/subtract functionality as discussed into another PR.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1] https://github.com/apache/commons-collections/pull/133
>> <
>>>>>>>>>>> https://github.com/apache/commons-collections/pull/133> <
>>>>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/133 <
>>>>>>>>>>> https://github.com/apache/commons-collections/pull/133>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> I like: Like Like - The likeliest place on the web
>>>>>>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
>>>>>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
>>>>>>>>>>> http://www.linkedin.com/in/claudewarren>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> I like: Like Like - The likeliest place on the web
>>>>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/> <
>>>>>>>>>>> http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
>>>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
>>>>>>>>>>> http://www.linkedin.com/in/claudewarren> <
>>>>>>>>>>>>> http://www.linkedin.com/in/claudewarren <
>>>>>>>>>>> http://www.linkedin.com/in/claudewarren>>
>>>>>>>>>>>> --
>>>>>>>>>>>> I like: Like Like - The likeliest place on the web
>>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>>
>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren <
>>>>>>>>>>> http://www.linkedin.com/in/claudewarren>
>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> I like: Like Like - The likeliest place on the web
>>>>>>>>>> <http://like-like.xenei.com>
>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>> 
>>>> 
>> 
> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to