> On 3 Mar 2020, at 22:31, Claude Warren <cla...@xenei.com> wrote: > > take a look at > https://github.com/apache/commons-collections/pull/131/files#diff-8b2bf046dc35c88908eef196937173e1 > > This is a different Hasher with a smaller data footprint than > DynamicHasher. Like the DynamicHasher it builds the values on the fly. > Can SplitIterators be implemented on them easily?
Yes. You use the same functions, just all in one go: public boolean tryAdvance(IntConsumer action) { if (hasNext()) { action.accept(nextInt()); return true; } return false; } You can convert between the two using: java.util.Spliterators: public static PrimitiveIterator.OfInt iterator(Spliterator.OfInt spliterator) public static Spliterator.OfInt spliteratorUnknownSize(PrimitiveIterator.OfInt iterator, int characteristics) { The conversion of an iterator to a spliterator done by this class acts exactly like the above code. > > I am torn between the interesting uses of a SplitIterator and the well > known usage pattern for an iterator. Perhaps we should implement both. It > seems to me (without really looking) that we should be able to implement an > Iterator that uses the SplitIterator. > > Thoughts? I think you could create a class that implements both at the same time. But most (all?) of the JDK collections have dedicated ones for each. From what I have seen in the bloom filter classes I think the Spliterator can be more efficient and will not be less efficient than an iterator for traversing the filter/hasher elements. > > On Tue, Mar 3, 2020 at 11:54 AM Alex Herbert <alex.d.herb...@gmail.com> > wrote: > >> >> On 02/03/2020 22:34, Claude Warren wrote: >>> So what we have then is: >>> >>> *public* *interface* BloomFilter { >>> >>> *int* andCardinality(BloomFilter other); >>> >>> *int* cardinality(); >>> >>> *boolean* contains(BloomFilter other); >>> >>> *boolean* contains(Hasher hasher); >>> >>> *long*[] getBits(); >>> >>> // Change >>> PrimitiveIterator.OfInt iterator(); >>> >>> Shape getShape(); >>> >>> >>> * // Change boolean* merge(BloomFilter other); >>> >>> >>> *// Change boolean* merge(Hasher hasher); >>> >>> *int* orCardinality(BloomFilter other); >>> >>> *int* xorCardinality(BloomFilter other); >>> >>> } >>> >>> public interface BitCountConsumer { >>> boolean accept(int index, int count); >>> } >>> >>> public interface CountingBloomFilter extends BloomFilter { >>> >>> >>> *// Change boolean* remove(BloomFilter other); >>> >>> // Change >>> *boolean* remove(Hasher hasher); >>> >>> // Change >>> boolean add(CountingBloomFilter other); >>> >>> // Change >>> boolean subtract(CountingBloomFilter other); >>> >>> // Change >>> void getCounts(BitCountConsumer consumer); >>> >>> } >>> >>> *public* *final* *class* StaticHasher *implements* Hasher { >>> >>> *public* StaticHasher(*final* Hasher hasher, *final* Shape shape) >> { ... >>> } >>> >>> *public* StaticHasher(*final* Iterator<Integer> iter, *final* Shape >>> shape) { ... } >>> >>> *public* StaticHasher(*final* Collection<Integer> collection, >> *final* >>> Shape shape) { ... } >>> >>> *public* Shape getShape() { *... *} >>> >>> *public* *int* size() { *... *} >>> >>> >>> *// Change public* *int* max() { *... *} >>> >>> } >>> >>> *public* *final* *class* FixedHasher *implements* Hasher { >>> >>> *public* FixedHasher(*final* Hasher hasher, *final* Shape shape) { >> ... } >>> >>> *public* FixedHasher(*final* Iterator<Integer> iter, *final* Shape >> shape) >>> { ... } >>> >>> *public* FixedHasher(*final* Collection<Integer> collection, *final* >>> Shape shape) { ... } >>> >>> *public* Shape getShape() { *... *} >>> >>> *public* *int* size() { *... *} >>> >>> *public* *int* max() { *... *} >>> >>> } >>> >>> I have a pull request in to add a CachingHasher. >>> >>> While we are at it, I think we should change Hasher.getBits(Shape) to >>> Hasher.iterator(Shape) to match the BloomFilter.iterator() and to limit >>> confusion between Hasher.getBits( Shape ) (return iterator) and >>> BloomFilter.getBits() returns (long[]) >>> >>> The interface changes look good to me. I agree with your implementation >>> ideas. >>> >>> I do have one use case where I need the BloomFilter.iterator() results in >>> reverse order (I am estimating Log2 of the filter), but I can do that >> with >>> the methods here. I don't think any of these changes significantly >> impacts >>> the uses that I currently have. >>> >>> Shall we move forward? >>> >>> Claude >> >> I finished a first draft of the CountingBloomFilter. I still have the >> tests to add for the new API. >> >> When working with this I had to write an Iterator (to create a Hasher >> for the soon to be changed API) which has very few lines of code, most >> of which are repeats due to the separation of hasNext() and nextInt() >> and the requirement to be able to call hasNext() repeatedly without >> calling nextInt(). This mandates storing the next value rather than a >> current position to search for a non-zero index: >> >> static class IndexIterator implements PrimitiveIterator.OfInt { >> int next; >> >> IndexIterator() { >> advance(0); >> } >> >> private void advance(int from) { >> next = from; >> while (next < counts.length && counts[next] == 0) { >> next++; >> } >> } >> >> @Override >> public boolean hasNext() { >> return next < counts.length; >> } >> >> @Override >> public int nextInt() { >> if (hasNext()) { >> final int result = next; >> advance(next + 1); >> return result; >> } >> throw new NoSuchElementException(); >> } >> } >> >> This is noted in the Spliterator javadoc: >> >> * <p>Spliterators, like {@code Iterator}s, are for traversing the >> elements of >> * a source. The {@code Spliterator} API was designed to support >> efficient >> * parallel traversal in addition to sequential traversal, by supporting >> * decomposition as well as single-element iteration. In addition, the >> * protocol for accessing elements via a Spliterator is designed to impose >> * smaller per-element overhead than {@code Iterator}, and to avoid the >> inherent >> * race involved in having separate methods for {@code hasNext()} and >> * {@code next()}. >> >> The same code written as a Spliterator has a much simpler tryAdvance >> method with only a single comparison of the current index to the upper >> limit per access of each element. >> >> static class IndexSpliterator implements Spliterator.OfInt { >> int index; >> >> @Override >> public long estimateSize() { >> return Long.MAX_VALUE; >> } >> >> @Override >> public int characteristics() { >> return Spliterator.SORTED | Spliterator.ORDERED | >> Spliterator.DISTINCT | Spliterator.NONNULL; >> } >> >> @Override >> public OfInt trySplit() { >> return null; >> } >> >> @Override >> public boolean tryAdvance(IntConsumer action) { >> while (index < counts.length) { >> final int count = counts[index++]; >> if (count != 0) { >> action.accept(count); >> return true; >> } >> } >> return false; >> } >> } >> >> Or with splitting allowed (because we can). The tryAdvance() is the same >> just with a different end point: >> >> static class IndexSpliterator implements Spliterator.OfInt { >> int index; >> final int end; >> >> IndexSpliterator() { >> this(0, counts.length); >> } >> >> IndexSpliterator(int start, int end) { >> index = start; >> this.end = end; >> } >> >> @Override >> public long estimateSize() { >> // This is an approximation. >> // A Bloom filter that is at capacity will have approximately >> // half of the bits set. >> return (end - index) / 2; >> } >> >> @Override >> public int characteristics() { >> return Spliterator.SORTED | Spliterator.ORDERED | >> Spliterator.DISTINCT | Spliterator.NONNULL; >> } >> >> @Override >> public OfInt trySplit() { >> final int middle = (index + end) >>> 1; >> if (middle > index) { >> final int start = index; >> index = middle; >> return new IndexSpliterator(start, middle); >> } >> return null; >> } >> >> @Override >> public boolean tryAdvance(IntConsumer action) { >> while (index < end) { >> final int count = counts[index++]; >> if (count != 0) { >> action.accept(count); >> return true; >> } >> } >> return false; >> } >> } >> >> What do you think to changing the use of Iterator to Spliterator for >> efficiency purposes? >> >> This would add: >> >> BloomFilter { >> Spliterator.OfInt spliterator(); >> } >> >> Hasher { >> Spliterator.OfInt spliterator(Shape shape); >> } >> >> CountingBloomFilter extends BloomFilter { >> // ??? >> Spliterator<int[]> countSpliterator(); >> } >> >> The final one is for completeness so you can use the streams API with >> the counts. I don't really like it. I'd prefer to pack the index and >> count into a long as stated previously to maximise efficiency. My >> suggestion would be to not have a spliterator for the counts and leave >> the access of <index, count> pairs to an explicit forEach consumer. >> >> Most of the current classes use a collection that provides a spliterator >> so the change is easy. The DynamicHasher Iterator would have to be >> rewritten but the Spliterator is simpler and all the code is already there. >> >> I am not sold on adding a default stream method as is done in the >> java.util.Collections API: >> >> default IntStream stream() { >> return StreamSupport.stream(spliterator(), false); >> } >> >> What do you think? I hope this is not going full circle back to an >> earlier API which was previously discussed. My aim is to make the API as >> efficient as possible and so starting with Spliterator rather than >> Iterator seems to be a better approach. >>> >>> >>> On Mon, Mar 2, 2020 at 6:02 PM Alex Herbert <alex.d.herb...@gmail.com> >>> wrote: >>> >>>> On 02/03/2020 16:12, Claude Warren wrote: >>>>> Does getCounts() return a snapshot of the values when the call was made >>>> or >>>>> does it return values that may be updated during the retrieval. If >> there >>>>> are 2 threads (one reading counts and one doing a merge) it seems to me >>>>> that the "iterate over the data without constructing objects" approach >>>>> means that the data may not be internally consistent. But then we >> don't >>>>> really check that issue in the other methods either. So let's go with >>>> the >>>>> fail fast BitCountConsumer approach. >>>> My assumption on this is that no filter is safe for concurrent use. >>>> Allowing a merge while reading the state is a concurrency issue. I don't >>>> think we want to go down the route of having fail-fast concurrency >>>> checking using a 'modified count' strategy, i.e. all operations save the >>>> current 'modified count' at the start and increment it at the end, if it >>>> changes before the end then there has been concurrent modification. It >>>> would be simpler to state that no filters are designed to be used >>>> concurrently. >>>> >>>>> On the StaticHasher it is used for two purposes: >>>>> >>>>> 1) getting a list of enabled bits. >>>>> 2) creating a bloom filter from a list of enabled bits. There was a >>>>> request early on for the ability to test membership in Bloom filter >>>> without >>>>> having to construct a bloom filter. Basically, the developer has a >> list >>>> of >>>>> values and wants to check membership. >>>> I think the issue is that for 1 and 2 there are a few scenarios with >>>> different requirements since the list of enabled bits can be small or >> big. >>>> >>>> One feature of the StaticHasher is the requirement to remove duplicates >>>> and return a sorted order. This allows for an efficient storage form >>>> factor with the intension that it may be around for a while. But it is >>>> an overhead when creating a large one from a known set of non-duplicate >>>> bits, i.e. a Bloom filter. >>>> >>>> For (2) if you want to do a dynamic check to test membership then you >>>> use BloomFilter.contains(Hasher). I would pass an on-the-fly Hasher for >>>> this and not construct the entire set of indices into a StaticHasher >>>> anyway. >>>> >>>> This would be done using the DynamicHasher to build hashes on-the-fly, >>>> which may have duplicates. For the use case where you already know the >>>> bit indices but do not want to create a BloomFilter or a StaticHasher >>>> then perhaps we need another Hasher that is a wrapper around an int[] >>>> that allows duplicates. >>>> >>>> // Returns an unordered set of values that may contain duplicates. >>>> >>>> static Hasher wrap(final int[] values, final Shape shape) { >>>> Objects.requireNonNull(values); >>>> Objects.requireNonNull(shape); >>>> return new Hasher() { >>>> @Override >>>> public OfInt getBits(Shape sameShape) { >>>> if (!shape.equals(sameShape)) { >>>> throw new IllegalArgumentException(); >>>> } >>>> return Arrays.stream(values).iterator(); >>>> } >>>> >>>> @Override >>>> public HashFunctionIdentity getHashFunctionIdentity() { >>>> return shape.getHashFunctionIdentity(); >>>> } >>>> >>>> @Override >>>> public boolean isEmpty() { >>>> return values.length == 0; >>>> } >>>> }; >>>> } >>>> >>>> >>>>> It might make more sense to reimplement the StaticHasher using a long[] >>>>> internally (like the BloomFilter.getBits() method returns). We have >> the >>>>> shape so we know the maximum length and we could trim it after loading >> to >>>>> remove trailing zero value longs. This would remove the necessity of >>>>> converting the iterator into a set of some sort. We can do the limit >>>>> checks quickly as we turn bits on. >>>> For small numbers of indices the long[] format is likely to be less >>>> space efficient. >>>>> Makes me think we might need to implement StandardBloomFilter to use >>>> long[] >>>>> as well. >>>> That is just the BitSetBloomFilter. So if you use this format in the >>>> StaticHasher then you have no advantage over actually creating a >>>> BitSetBloomFilter. >>>> >>>> The StaticHasher is returned by the BloomFilter interface. You also have >>>> a method to return the bits using the long[] format. So there is >>>> redundancy here. To drop the redundancy I think you separate the uses >>>> for the canonical representation of the entire filter as a long[] and >>>> the uses for the set bits. If the later is satisfied by the StaticHasher >>>> then it should be replaced by an iterator instead. >>>> >>>> Looking at the code the getHasher() method in BloomFilter is only used >>>> in HasherBloomFilter. This is to add more enabled bits to the current >>>> set it contains. It seems that this use case is better served by using a >>>> Set as the backing storage. >>>> >>>> This leaves the StaticHasher with no usage in the current library. It is >>>> not as fast as wrapping a provided array. It offers no expansion of >>>> storage for use in the HasherBloomFilter. It does however provide the >>>> smallest form for a compact BloomFilter representation. So it could be >>>> renamed to ImmutableHasher. >>>> >>>> I would think it best to consolidate all of this: >>>> >>>> - StaticHasher to stay as an immutable instance of sorted bit indices. >>>> The intention is an efficient compact form for a few bits of a >>>> BloomFilter than can be merged efficiently. It could be used with >>>> pre-exisiting indices but would have to eliminate duplicates and sort >> them. >>>> - Remove the redundant StaticHasher constructor accepting a StaticHasher >>>> instance since they are immutable. >>>> - Possibly add a method to StaticHasher to merge another StaticHasher to >>>> create a new instance. >>>> - Possibly add a method to StaticHasher to get the maximum index in the >>>> set (it already has size()). This helps convert it to a long[] or any >>>> other class that will call getBits() and use the iterator output. >>>> >>>> - BloomFilter to remove the getHasher() method. The long[] getBits() >>>> method to be documented as returning the canonical format of the filter. >>>> The getHasher() method is just duplicating this information in another >>>> format. >>>> - BloomFilter to have an iterator() method to get a >>>> PrimitiveIterator.OfInt. This replaces the getHasher() method which just >>>> provided access to the Shape and the PrimitiveIterator.OfInt of the >>>> Hasher. BloomFilter already has getShape() so it just needs an iterator. >>>> >>>> - HasherBloomFilter to use the iterator() method when merging a generic >>>> BloomFilter. It can detect instances of HasherBloomFilter and switch to >>>> a merge operation of its internal representation. >>>> - HasherBloomFilter to use different backing storage (e.g. a >>>> Set<Integer>) or adapt StaticHasher to be more efficient for merge. >>>> >>>> - Add a new class akin to StaticHasher that allows duplicates. Document >>>> it as a fast way to create a few bits of a BloomFilter. >>>> >>>> You then have: >>>> >>>> Hasher >>>> >>>> - StaticHasher - immutable set of ordered non-duplicate bits. Can be >>>> merged with another instance. >>>> >>>> - DynamicHasher - generates bits on the fly using a hashing function >>>> >>>> - FixedHasher - an array of indices. Can wrap an input array or be >>>> created using constructors similar to StaticHasher but without the >>>> guarantees of non-duplicates. >>>> >>>> >>>> Lots to think about there. >>>> >>>>> Claude >>>>> >>>>> On Mon, Mar 2, 2020 at 1:12 PM Alex Herbert <alex.d.herb...@gmail.com> >>>>> wrote: >>>>> >>>>>> On 02/03/2020 11:32, Claude Warren wrote: >>>>>>> my thought on changing the BloomFilter.merge() to return a boolean is >>>>>> along >>>>>>> the lines you had: return true on successful merge (even if there are >>>> no >>>>>>> changes in the enabled bits). And yes, for most cases the standard >>>> bloom >>>>>>> filter will return true, but the change is really to support >> extensions >>>>>> to >>>>>>> Bloom filter so I think it is reasonable. >>>>>>> >>>>>>> As for the getCounts(). Suppose we split it into two methods: >>>>>>> >>>>>>> // return the indexes that are enabled. This is equivalent to >>>>>>> CountingBloomFilter.getHasher().getBits( >> CountingBloomFilter.getShape() >>>>>> ); >>>>>>> Iterator<int> getIndexes() >>>>>>> // get the count for the specific index. >>>>>>> int getCount( int index ); >>>>>>> >>>>>>> With these methods It becomes possible to construct an iterator of >>>> int[] >>>>>> or >>>>>>> Map.Entry<Integer,Integer> or whatever else the developer wants. >>>>>> Having to call getCount() for every index is likely to be slower than >>>>>> all the garbage collection involved with iterating a disposable >> <index, >>>>>> count> pair. For an array backed storage the access will be order(1). >>>>>> But other storage may be a lot slower. This method ideally should >>>>>> traverse the storage once and make each <index, count> pair available. >>>>>> Ideally this would be without creating each <index, count> pair as a >>>>>> disposable object. >>>>>> >>>>>> Following on from previously, another disadvantage of the consumer >>>>>> approach is the lack of fast fail. You can remedy this by adding a >>>>>> boolean return from the consumer to indicate if you want to consume >> more >>>>>> items: >>>>>> >>>>>> interface BitCountConsumer { >>>>>> boolean accept(int index, int count); >>>>>> } >>>>>> >>>>>> A CountingBloomFilter implementation then has something like: >>>>>> >>>>>> void getCounts(BitCountConsumer consumer) { >>>>>> while(hasNext()) { >>>>>> next(); >>>>>> if (!consumer.accept(index(), count()) { >>>>>> break; >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> Looking at the getHasher().getBits(...) idea I noticed that >> StaticHasher >>>>>> is constructed using: >>>>>> >>>>>> Iterator<Integer> iter >>>>>> >>>>>> not: >>>>>> >>>>>> PrimitiveIterator.OfInt >>>>>> >>>>>> I'd prefer the primitive specialisation in this constructor. It would >>>>>> break the HasherBloomFilter merge function but that can be fixed. >>>>>> However it may be redundant. The StaticHasher has some oddities with >> the >>>>>> available constructors: >>>>>> >>>>>> public StaticHasher(final StaticHasher hasher, final Shape shape) >>>>>> >>>>>> - why have a shape parameter? The constructor checks the shape is the >>>>>> same as that for the hasher. So why not just a plain copy constructor: >>>>>> >>>>>> public StaticHasher(final StaticHasher hasher) >>>>>> >>>>>> This constructor is only used in the unit tests. Given the >> StaticHasher >>>>>> is immutable then a copy constructor seems odd. I think it should be >>>>>> dropped. >>>>>> >>>>>> >>>>>> Here are is a constructor that is missing IMO: >>>>>> >>>>>> public StaticHasher(final Set<Integer> indices, final Shape shape) >>>>>> >>>>>> - Could be used to generically create a static hasher. The indices are >>>>>> ensured to be unique by the Set but should be checked to be within the >>>>>> shape. >>>>>> >>>>>> >>>>>> With these constructors the public StaticHasher(final >> Iterator<Integer> >>>>>> iter, final Shape shape) is only used by the BitSetBloomFilter: >>>>>> >>>>>> public StaticHasher getHasher() { >>>>>> return new StaticHasher(bitSet.stream().iterator(), >>>> getShape()); >>>>>> } >>>>>> >>>>>> Becomes: >>>>>> >>>>>> public StaticHasher getHasher() { >>>>>> return new >>>>>> >>>> >> StaticHasher(bitSet.stream().boxed().collect(Collectors.toCollection(TreeSet::new)), >>>>>> getShape()); >>>>>> } >>>>>> >>>>>> This is a very roundabout way to construct a StaticHasher. Perhaps a >>>>>> constructor accepting a BitSet would be useful? >>>>>> >>>>>> So the API only requires: >>>>>> >>>>>> public StaticHasher(final Hasher hasher, final Shape shape) >>>>>> public StaticHasher(final Set<Integer> indices, final Shape shape) >>>>>> // Maybe >>>>>> public StaticHasher(final BitSet indices, final Shape shape) >>>>>> >>>>>> >>>>>> I wondered about a merge function for the StaticHasher: >>>>>> >>>>>> public StaticHasher merge(final StaticHasher other) >>>>>> >>>>>> The merge function can make some assumptions about the two arrays to >>>>>> merge as the class is final and the values are verified to be sorted >> and >>>>>> within the Shape. However there is not a use for it other than in the >>>>>> merge of two HasherBloomFilters, which is not a recommended operation >>>>>> anyway. >>>>>> >>>>>> >>>>>>> Claude >>>>>>> >>>>>>> On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert < >> alex.d.herb...@gmail.com >>>>>>> wrote: >>>>>>> >>>>>>>> On 02/03/2020 09:38, Claude Warren wrote: >>>>>>>>> It is not too late to update the BloomFIlter interface to have the >>>>>> merge >>>>>>>>> return a boolean. The incorrect Shape would still throw an >>>> exception, >>>>>> so >>>>>>>>> the return value would only come into play if the bits could not be >>>>>> set. >>>>>>>>> thoughts? >>>>>>>> I don't see the harm in it. But what would the return value be for? >>>>>>>> >>>>>>>> For a standard collection it would be if the collection was changed >> by >>>>>>>> the operation: >>>>>>>> >>>>>>>> Collection.add/remove return "true if this collection changed as a >>>>>>>> result of the call" >>>>>>>> >>>>>>>> So here is the equivalent: >>>>>>>> >>>>>>>> return "true if this filter was changed as a result of the call" >>>>>>>> >>>>>>>> This is computationally slow to track. It also is confusing if the >>>>>>>> filter was successfully merged but no bits were changed to then >> return >>>>>>>> false because the filter was actually incorporated. So it would go >>>> along >>>>>>>> the lines that we discussed for the counting Bloom filter: >>>>>>>> >>>>>>>> return "true if this filter was successfully merged as a result of >> the >>>>>>>> call" >>>>>>>> >>>>>>>> For most cases in the current library it would be true when an >>>> exception >>>>>>>> is not thrown. However the merge of the counting Bloom filter may >> have >>>>>>>> reason to return false, e.g. overflow. >>>>>>>> >>>>>>>>> On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <cla...@xenei.com> >>>> wrote: >>>>>>>>>> for the remove(), add(), and subtract() methods I agree that void >> is >>>>>> not >>>>>>>>>> correct and it should be boolean and be the same as the value you >>>>>> would >>>>>>>> get >>>>>>>>>> from calling isValid(). >>>>>>>>>> >>>>>>>>>> You are correct the getCounts() should return an iterator of some >>>> type >>>>>>>> on >>>>>>>>>> int[], I don't know why I thought long[]. I am happy with a plain >>>>>>>>>> Iterator<int[]> as the return. >>>>>>>> For the getCounts() method I am still looking for a way around >> having >>>> to >>>>>>>> create an <index, count> pair for everything in the filter. An >>>>>>>> alternative to an iterator is to use the consumer idea. Given there >> is >>>>>>>> no primitive specialisation of BiConsumer<T, U> in JDK 8 functions >> we >>>>>>>> define our own: >>>>>>>> >>>>>>>> interface BitCountConsumer { >>>>>>>> void accept(int index, int count); >>>>>>>> } >>>>>>>> >>>>>>>> The CountingBloomFilter then has: >>>>>>>> >>>>>>>> void forEachCount(BitCountConsumer consumer); >>>>>>>> // Or >>>>>>>> void getCounts(BitCountConsumer consumer); >>>>>>>> >>>>>>>> >>>>>>>> You can then directly pass the counts from the backing storage to >> the >>>>>>>> destination. >>>>>>>> >>>>>>>> Advantages: >>>>>>>> - No object creation >>>>>>>> Disadvantages >>>>>>>> - The counts cannot be streamed >>>>>>>> >>>>>>>> An alternative is to provide an Iterator of long with the index and >>>>>>>> count packed into a long with methods to extract them: >>>>>>>> >>>>>>>> PrimativeIterator.OfLong getCounts(); >>>>>>>> >>>>>>>> default static int getIndex(long indexCountPair) { >>>>>>>> return (int) (indexCountPair >>> 32); >>>>>>>> } >>>>>>>> >>>>>>>> default static int getCount(long indexCountPair) { >>>>>>>> return (int) indexCountPair; >>>>>>>> } >>>>>>>> >>>>>>>> This will almost certainly be a cause for bugs/issues from users. >>>>>>>> >>>>>>>> I believe that the counts will be used for 2 main use cases: >>>>>>>> >>>>>>>> 1. Storage >>>>>>>> >>>>>>>> 2. Adding to another counting Bloom filter >>>>>>>> >>>>>>>> Both cases are likely to be done serially and not in parallel. So >>>>>>>> providing a consumer based API to receive the counts would work. >>>>>>>> >>>>>>>> WDYT? >>>>>>>> >>>>>>>>>> Claude >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert < >>>> alex.d.herb...@gmail.com >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>>> On 1 Mar 2020, at 15:39, Claude Warren <cla...@xenei.com> >> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I think the CountingBloomFilter interface needs to extend >>>>>> BloomFilter. >>>>>>>>>>> I said that but did not write it, sorry. >>>>>>>>>>> >>>>>>>>>>>> I think I am confused. >>>>>>>>>>>> >>>>>>>>>>>> I would expect CountingBloomFilter to have >>>>>>>>>>>> >>>>>>>>>>>> interface CountingBloomFilter extends BloomFilter { >>>>>>>>>>>> // these 2 methods are the reverse of merge() >>>>>>>>>>>> void remove( BloomFilter ); >>>>>>>>>>>> void remove( Hasher ); >>>>>>>>>>> Fine. Same intention but different method names. But why void? It >>>>>>>> forces >>>>>>>>>>> you to check if the remove was valid with a second call. On the >>>> plus >>>>>>>> side >>>>>>>>>>> it matches the void merge(…) methods and in most cases a user >> would >>>>>> not >>>>>>>>>>> care to check anyway. If they are controlling the filter then >> leave >>>>>> it >>>>>>>> to >>>>>>>>>>> them to make sure they do not remove something they did not add. >>>>>>>>>>> >>>>>>>>>>>> // these 2 methods are the addition/subtraction of counts >>>>>>>>>>>> void add( CountingBloomFilter ) >>>>>>>>>>>> void subtract( CountingBloomFilter ); >>>>>>>>>>> Fine. But same comment as above with the void return. >>>>>>>>>>> >>>>>>>>>>>> // 2 methods to retrieve data >>>>>>>>>>>> Stream<long[]> getCounts(); >>>>>>>>>>> I don’t like this use of long[]. In my previous post I argued >> that >>>> if >>>>>>>> you >>>>>>>>>>> were to ever want to store more than max integer items in a >> filter >>>>>>>> then the >>>>>>>>>>> Bloom filter would have more bit indices than max integer. So we >>>>>> never >>>>>>>> have >>>>>>>>>>> to support long counts. A filter that exceeds max integer for a >>>> count >>>>>>>> is >>>>>>>>>>> highly likely to be saturated and no use as a filter anyway. >>>>>>>>>>> >>>>>>>>>>> For most backing implementations the object type of the stream >> will >>>>>> be >>>>>>>>>>> different so you will have to write a Spliterator<T> >> implementation >>>>>> or >>>>>>>> wrap >>>>>>>>>>> some iterator anyway. So why not return the Spliterator: >>>>>>>>>>> >>>>>>>>>>> Spliterator<int[]> getCounts(); >>>>>>>>>>> >>>>>>>>>>> Since the backing implementation will likely not store int[] >> pairs >>>>>> then >>>>>>>>>>> this will have a lot of object creation and garbage collection >>>>>>>> overhead to >>>>>>>>>>> go through the counts. This does not seem to be a big concern >> here >>>> if >>>>>>>> the >>>>>>>>>>> purpose is the same as for the BloomFilter for long[] getBits(), >>>> i.e. >>>>>>>> to >>>>>>>>>>> get a canonical representation for storage. >>>>>>>>>>> >>>>>>>>>>> Note: The Spliterator may not have a known size (number of non >> zero >>>>>>>> bits) >>>>>>>>>>> at creation, for example if the counts are stored in a fixed size >>>>>>>> array. >>>>>>>>>>> Thus efficient parallel traversal by binary tree splitting is >>>> limited >>>>>>>> by >>>>>>>>>>> how evenly the counts are distributed. For a backing >> implementation >>>>>>>> using a >>>>>>>>>>> collection then the size should be known. In this case a >>>> Spliterator >>>>>>>> would >>>>>>>>>>> be of more use than a plain Iterator. You can convert one to the >>>>>> other >>>>>>>>>>> using: >>>>>>>>>>> >>>>>>>>>>> java.util.Spliterators: >>>>>>>>>>> public static<T> Iterator<T> iterator(Spliterator<? extends T> >>>>>>>>>>> spliterator) >>>>>>>>>>> public static <T> Spliterator<T> spliterator(Iterator<? extends >> T> >>>>>>>>>>> iterator, >>>>>>>>>>> long size, >>>>>>>>>>> int >>>>>> characteristics) >>>>>>>>>>> So which to choose for the API? >>>>>>>>>>> >>>>>>>>>>> The Hasher currently uses an Iterator: >>>>>>>>>>> >>>>>>>>>>> PrimitiveIterator.OfInt getBits(Shape shape); >>>>>>>>>>> >>>>>>>>>>> In the case of a StaticHasher this could return a spliterator. >> But >>>>>> the >>>>>>>>>>> DynamicHasher needs a reworking of the internal Iterator class. >> It >>>>>>>> could be >>>>>>>>>>> a Spliterator to use the new IntConsumer API but in most (all) >>>> cases >>>>>>>>>>> splitting would not be possible for dynamic hashing as the parts >>>> are >>>>>>>>>>> produced in order. It is likely that they will be consumed >>>>>>>> sequentially too. >>>>>>>>>>> I would suggest that Spliterator is the more modern >> implementation, >>>>>>>>>>> despite not always being applicable to parallelisation in a >> stream. >>>>>>>>>>> Currently the iterator from the Hasher is used in >>>> forEachRemaining() >>>>>>>> and >>>>>>>>>>> while loop is approximately equal measure. The while loops are >> for >>>> a >>>>>>>> fast >>>>>>>>>>> exit and would be uglier if rewritten for a >>>>>>>>>>> Spliterator.tryAdvance(IntConsumer) syntax. >>>>>>>>>>> >>>>>>>>>>> There is a use of the IteratorChain in HasherBloomFilter that >> would >>>>>>>> need >>>>>>>>>>> a rethink for spliterators. >>>>>>>>>>> >>>>>>>>>>> The path of least resistance is to use Iterator<int[]> for the >> API >>>> of >>>>>>>>>>> CountingBloomFilter to be consistent with Hasher’s use of >> Iterator. >>>>>>>>>>> >>>>>>>>>>> WDYT? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> boolean isValid() >>>>>>>>>>> Fine. Allows some level of feedback that the counts are corrupt. >>>>>>>>>>> >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> Claude >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert < >>>>>> alex.d.herb...@gmail.com >>>>>>>>>>> <mailto:alex.d.herb...@gmail.com>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>>> On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> >>>> wrote: >>>>>>>>>>>>>> The idea of a backing array is fine and the only problem I see >>>>>> with >>>>>>>>>>> it is >>>>>>>>>>>>>> in very large filters (on the order of 10^8 bits and larger) >> but >>>>>>>>>>> document >>>>>>>>>>>>>> the size calculation and let the developer worry about it. >>>>>>>>>>>>> Let us look at the use case where we max out the array. Using >> the >>>>>>>> Bloom >>>>>>>>>>>>> filter calculator: >>>>>>>>>>>>> >>>>>>>>>>>>> n = 149,363,281 >>>>>>>>>>>>> p = 0.001000025 (1 in 1000) >>>>>>>>>>>>> m = 2147483647 (256MiB) >>>>>>>>>>>>> k = 10 >>>>>>>>>>>>> >>>>>>>>>>>>> n = 74,681,641 >>>>>>>>>>>>> p = 0.000001 (1 in 999950) >>>>>>>>>>>>> m = 2147483647 (256MiB) >>>>>>>>>>>>> k = 20 >>>>>>>>>>>>> >>>>>>>>>>>>> n = 49,787,761 >>>>>>>>>>>>> p = 0.000000001 (1 in 999924899) >>>>>>>>>>>>> m = 2147483647 (256MiB) >>>>>>>>>>>>> k = 30 >>>>>>>>>>>>> >>>>>>>>>>>>> So you will be able to put somewhere in the order of 10^8 or >> 10^7 >>>>>>>> items >>>>>>>>>>>>> into the filter. I would say that anyone putting more than that >>>>>> into >>>>>>>>>>> the >>>>>>>>>>>>> filter has an unusual use case. The CountingBloomFilter can >> throw >>>>>> an >>>>>>>>>>>>> exception if m is too large and will throw an OutOfMemoryError >> if >>>>>> you >>>>>>>>>>>>> cannot allocate an array large enough. >>>>>>>>>>>>> >>>>>>>>>>>>> One clear point here is that you cannot use a short as a 16-bit >>>>>> count >>>>>>>>>>>>> would easily overflow. So you have to use an integer array for >>>> the >>>>>>>>>>> counts. >>>>>>>>>>>>> A maximum length int[] is roughly 8GB. >>>>>>>>>>>>> >>>>>>>>>>>>> What would another implementation cost in terms of memory? The >>>>>>>>>>>>> TreeMap<Integer, Integer> was the most space efficient. In the >>>>>>>> previous >>>>>>>>>>>>> e-mail the saturation of a Bloom filter bits was approximately >>>> 50% >>>>>>>>>>> when at >>>>>>>>>>>>> the intended capacity. So we have to estimate the size of a >>>> TreeMap >>>>>>>>>>>>> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test >>>>>> shows >>>>>>>>>>> the >>>>>>>>>>>>> TreeMap memory scales linearly with entries: >>>>>>>>>>>>> >>>>>>>>>>>>> 32768 / 65536 (0.500) : TreeMap<Integer, Integer> = >>>> 1834061 >>>>>>>>>>> bytes >>>>>>>>>>>>> 65536 / 131072 (0.500) : TreeMap<Integer, Integer> = >>>> 3669080 >>>>>>>>>>> bytes >>>>>>>>>>>>> 131072 / 262144 (0.500) : TreeMap<Integer, Integer> = >>>> 7339090 >>>>>>>>>>> bytes >>>>>>>>>>>>> So what is the memory for a TreeMap with 2^30 indices. I make >> it >>>>>>>> about: >>>>>>>>>>>>> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB >>>>>>>>>>>>> >>>>>>>>>>>>> I would say that this amount of RAM is unusual. It is >> definitely >>>>>> not >>>>>>>> as >>>>>>>>>>>>> efficient as using an array. So very large counting Bloom >> filters >>>>>> are >>>>>>>>>>> going >>>>>>>>>>>>> to require some thought as to the hardware they run on. This >> may >>>>>> not >>>>>>>>>>> be the >>>>>>>>>>>>> case in 10 years time. >>>>>>>>>>>>> >>>>>>>>>>>>> I would say that we try an int[] backing array for the storage >>>>>>>>>>>>> implementation and document it’s limitations. A different >>>>>>>>>>> implementation >>>>>>>>>>>>> could be provided in future if required. >>>>>>>>>>>>> >>>>>>>>>>>>> This could be done by making CountingBloomFilter an interface >>>> that >>>>>>>>>>> extends >>>>>>>>>>>>> BloomFilter with the methods: >>>>>>>>>>>>> >>>>>>>>>>>>> subtract(BloomFilter filter) >>>>>>>>>>>>> subtract(Hasher filter) >>>>>>>>>>>>> >>>>>>>>>>>>> These will negate the effect of the corresponding >>>>>> merge(BloomFilter) >>>>>>>>>>>>> operation. >>>>>>>>>>>>> >>>>>>>>>>>>> Do we also need access to the counts and add/subtract of >> another >>>>>>>>>>>>> CountingBloomFilter?: >>>>>>>>>>>>> >>>>>>>>>>>>> add(CountingBloomFilter filter); >>>>>>>>>>>>> subtract(CountingBloomFilter filter); >>>>>>>>>>>>> >>>>>>>>>>>>> Iterator<int[]> getCounts(); >>>>>>>>>>>>> int getSize(); // Number of items added >>>>>>>>>>>>> >>>>>>>>>>>>> The CountingBloomFilter is then an interface that defines how >> to >>>>>>>>>>> reverse >>>>>>>>>>>>> the merge of some bits into the filter. >>>>>>>>>>>>> >>>>>>>>>>>>> My concern is the inefficiency of the creation of objects in >> any >>>>>>>> method >>>>>>>>>>>>> that provides access to the counts (e.g. above using an >> iterator >>>> as >>>>>>>> for >>>>>>>>>>>>> Hasher.getBits()). I presume this method would be to allow some >>>>>> type >>>>>>>> of >>>>>>>>>>>>> storage/serialisation of the filter akin to the long[] >> getBits() >>>>>>>>>>> method of >>>>>>>>>>>>> BloomFilter. So it may be better done using a method: >>>>>>>>>>>>> >>>>>>>>>>>>> int getCount(int index); >>>>>>>>>>>>> >>>>>>>>>>>>> The caller can then use long[] getBits() to get the indices set >>>> in >>>>>>>> the >>>>>>>>>>>>> filter and then for each non-zero bit index call >> getCount(index). >>>>>> Or >>>>>>>>>>> just >>>>>>>>>>>>> not include the method as the counts are only of concern when >>>>>> storing >>>>>>>>>>> the >>>>>>>>>>>>> filter. This functionality is cleaner pushed into an >>>>>> implementation. >>>>>>>>>>>>> In a previous post we discussed whether to throw an exception >> on >>>>>>>>>>>>> overflow/underflow or raise in an invalid flag. Using the >> invalid >>>>>>>> flag >>>>>>>>>>> idea >>>>>>>>>>>>> the interface would be: >>>>>>>>>>>>> >>>>>>>>>>>>> interface CountingBloomFilter { >>>>>>>>>>>>> int add(CountingBloomFilter filter); >>>>>>>>>>>>> int subtract(BloomFilter filter); >>>>>>>>>>>>> int subtract(Hasher filter); >>>>>>>>>>>>> int subtract(CountingBloomFilter filter); >>>>>>>>>>>>> int getStatus(); >>>>>>>>>>>>> >>>>>>>>>>>>> // Maybe >>>>>>>>>>>>> int getSize(); >>>>>>>>>>>>> int getCount(int index); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> The status would be negative if any operation >>>>>> overflowed/underflowed, >>>>>>>>>>> or >>>>>>>>>>>>> zero if OK. The current status is returned by the add/subtract >>>>>>>> methods. >>>>>>>>>>>>> However I note that overflow may not be a concern. The number >> of >>>>>>>> items >>>>>>>>>>> to >>>>>>>>>>>>> add to a filter to create overflow would be using a filter >> with a >>>>>>>>>>> number of >>>>>>>>>>>>> bits that is unrealistic to store in memory: >>>>>>>>>>>>> >>>>>>>>>>>>> n = 2147483647 >>>>>>>>>>>>> p = 0.001000025 (1 in 1000) >>>>>>>>>>>>> m = 30875634182 (3.59GiB) >>>>>>>>>>>>> k = 10 >>>>>>>>>>>>> >>>>>>>>>>>>> If you want to add 2 billion items (and overflow an integer >>>> count) >>>>>>>> then >>>>>>>>>>>>> your filter would be so big it would break the rest of the API >>>> that >>>>>>>>>>> uses a >>>>>>>>>>>>> 32-bit int for the bit index. >>>>>>>>>>>>> >>>>>>>>>>>>> Thus only underflow is a realistic concern. This could be >>>>>> documented >>>>>>>> as >>>>>>>>>>>>> handled in an implementation specific manner (i.e. throw or >>>>>> ignore). >>>>>>>>>>> The >>>>>>>>>>>>> API is then simplified to: >>>>>>>>>>>>> >>>>>>>>>>>>> interface CountingBloomFilter { >>>>>>>>>>>>> boolean add(CountingBloomFilter filter); >>>>>>>>>>>>> boolean subtract(BloomFilter filter); >>>>>>>>>>>>> boolean subtract(Hasher filter); >>>>>>>>>>>>> boolean subtract(CountingBloomFilter filter); >>>>>>>>>>>>> int getStatus(); >>>>>>>>>>>>> >>>>>>>>>>>>> // Maybe >>>>>>>>>>>>> int getSize(); >>>>>>>>>>>>> int getCount(int index); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> The boolean is used to state that add/subtract did not >>>>>>>> over/underflow. >>>>>>>>>>>>> Implementations can throw if they require it. >>>>>>>>>>>>> >>>>>>>>>>>>> The question then becomes what does getSize() represent if an >>>>>>>>>>> add/subtract >>>>>>>>>>>>> did not execute cleanly. Under this scheme it would be the >> number >>>>>> of >>>>>>>>>>> (add - >>>>>>>>>>>>> subtract) operations. The status flag would be used to indicate >>>> if >>>>>>>> the >>>>>>>>>>> size >>>>>>>>>>>>> is valid, or any of the counts from getCount(). The simpler API >>>> is >>>>>> to >>>>>>>>>>> not >>>>>>>>>>>>> allow access to counts/size or adding/subtracting counts: >>>>>>>>>>>>> >>>>>>>>>>>>> interface CountingBloomFilter { >>>>>>>>>>>>> boolean subtract(BloomFilter filter); >>>>>>>>>>>>> boolean subtract(Hasher filter); >>>>>>>>>>>>> int getStatus(); >>>>>>>>>>>>> // Or something like ... >>>>>>>>>>>>> boolean isValid(); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> This filter is then only concerned with reversing the merge of >>>>>> Bloom >>>>>>>>>>>>> filters with a valid status flag to indicate that the current >>>> state >>>>>>>> is >>>>>>>>>>>>> consistent (i.e. all filters have been cleanly >>>> merged/subtracted). >>>>>>>>>>>>> WDYT? >>>>>>>>>>>>> >>>>>>>>>>>>>> As for the merge question. merge is a standard bloom filter >>>>>>>>>>> operation. >>>>>>>>>>>>> It >>>>>>>>>>>>>> is well defined in the literature. Merging a bloom filter >> into >>>> a >>>>>>>>>>>>> counting >>>>>>>>>>>>>> bloom filter means incrementing the bit counts. I think that >>>>>>>>>>>>> merge/remove >>>>>>>>>>>>>> should continue to operate as though the parameter were a >>>> standard >>>>>>>>>>> bloom >>>>>>>>>>>>>> filter. >>>>>>>>>>>>>> >>>>>>>>>>>>> OK. So the count is to represent the number of filters that >> had a >>>>>> bit >>>>>>>>>>> set >>>>>>>>>>>>> at that index. This makes it more clear. >>>>>>>>>>>>> >>>>>>>>>>>>>> We had spoken of implementing and adding/deleting method pair >>>> that >>>>>>>>>>> would >>>>>>>>>>>>>> operate on CountingBloomFilters and would add/subtract the >>>> counts. >>>>>>>>>>> (e.g. >>>>>>>>>>>>>> add(CountingBloomFilter) and subtract(CountingBloomFilter)) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I disagree with your proposal for the merge(Hasher) >>>>>> implementation, >>>>>>>>>>> and I >>>>>>>>>>>>>> am not certain that an add(Hasher) makes sense. First >> consider >>>>>> that >>>>>>>>>>> the >>>>>>>>>>>>>> Hasher returns the bits that are to be enabled in the Bloom >>>> filter >>>>>>>> so >>>>>>>>>>>>>> collisions are expected. In the normal course of events a >>>> Hasher >>>>>> is >>>>>>>>>>> used >>>>>>>>>>>>>> to create a normal Bloom filter where all the duplicates are >>>>>>>> removed. >>>>>>>>>>>>> That >>>>>>>>>>>>>> filter is then merged into a CountingBloomFilter. So in some >>>>>> sense >>>>>>>>>>> the >>>>>>>>>>>>>> Hasher and the normal Bloom filter are the same. So I would >>>>>> expect >>>>>>>>>>> the >>>>>>>>>>>>>> merge of the Hasher and the merge of the normal Bloom filter >>>>>> created >>>>>>>>>>> from >>>>>>>>>>>>>> that hasher into a CountingBloomFilter to yield the same >> result. >>>>>> If >>>>>>>>>>> you >>>>>>>>>>>>>> wanted to add an add(Hasher)/delete(Hasher) pair of functions >>>> to a >>>>>>>>>>>>>> CountingBloomFilter you could implement with duplicate >> counting, >>>>>> but >>>>>>>>>>> I am >>>>>>>>>>>>>> not certain of the validity of such a count and I fear that it >>>>>>>> muddies >>>>>>>>>>>>> the >>>>>>>>>>>>>> waters with respect to what the CountingBloomFilter is >> counting. >>>>>>>>>>>>> Agreed. >>>>>>>>>>>>> >>>>>>>>>>>>>> Claude >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert < >>>>>>>>>>> alex.d.herb...@gmail.com >>>>>>>>>>>>> <mailto:alex.d.herb...@gmail.com <mailto: >>>> alex.d.herb...@gmail.com >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com >>>>>>>> <mailto: >>>>>>>>>>> cla...@xenei.com> <mailto: >>>>>>>>>>>>> cla...@xenei.com <mailto:cla...@xenei.com>>> wrote: >>>>>>>>>>>>>>>> Alex would you take a look at pull request 131 [1]. it >> adds a >>>>>> new >>>>>>>>>>>>> hasher >>>>>>>>>>>>>>>> implementation and makes the HashFunctionValidator available >>>> for >>>>>>>>>>> public >>>>>>>>>>>>>>> use. >>>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 < >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131> < >>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 < >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131>> < >>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 < >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131> < >>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/131 < >>>>>>>>>>> https://github.com/apache/commons-collections/pull/131>>> >>>>>>>>>>>>>>> OK. I’ll take a look. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’ve been thinking about the counting Bloom filter and the >>>>>> backing >>>>>>>>>>>>>>> storage. In summary: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. The backing storage should be a fixed array. >>>>>>>>>>>>>>> 2. Merging a Hasher should count duplicate indices, not >> flatten >>>>>>>> them >>>>>>>>>>> all >>>>>>>>>>>>>>> to a single count. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For background I’ve used the standard formulas to estimate >> the >>>>>>>>>>> number of >>>>>>>>>>>>>>> indices that will be non-zero in a Bloom filter. The >> wikipedia >>>>>> page >>>>>>>>>>>>> gives >>>>>>>>>>>>>>> this formula for the expected number of bits set to 0 (E(q)) >> if >>>>>> you >>>>>>>>>>> have >>>>>>>>>>>>>>> inserted i elements into a filter of size m using k hash >>>>>> functions: >>>>>>>>>>>>>>> E(q) = (1 - 1/m)^ki" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So a rough guess of the number of indices (bits) used by a >>>> filter >>>>>>>> is >>>>>>>>>>>>>>> 1-E(q). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here is a table of Bloom filters with different collision >>>>>>>>>>> probabilities >>>>>>>>>>>>>>> and the proportion of bits that will be set when 1%, 10%, >> 100% >>>> of >>>>>>>> the >>>>>>>>>>>>>>> capacity of the filter has been met: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> n p m k I E(q) bits >>>>>>>>>>>>>>> 1000 1E-04 19171 13 10 0.9932 0.0068 >>>>>>>>>>>>>>> 1000 1E-04 19171 13 100 0.9344 0.0656 >>>>>>>>>>>>>>> 1000 1E-04 19171 13 1000 0.5076 0.4924 >>>>>>>>>>>>>>> 1000 1E-05 23963 17 10 0.9929 0.0071 >>>>>>>>>>>>>>> 1000 1E-05 23963 17 100 0.9315 0.0685 >>>>>>>>>>>>>>> 1000 1E-05 23963 17 1000 0.4919 0.5081 >>>>>>>>>>>>>>> 1000 1E-06 28756 20 10 0.9931 0.0069 >>>>>>>>>>>>>>> 1000 1E-06 28756 20 100 0.9328 0.0672 >>>>>>>>>>>>>>> 1000 1E-06 28756 20 1000 0.4988 0.5012 >>>>>>>>>>>>>>> 10000 1E-06 287552 20 100 0.9931 0.0069 >>>>>>>>>>>>>>> 10000 1E-06 287552 20 1000 0.9328 0.0672 >>>>>>>>>>>>>>> 10000 1E-06 287552 20 10000 0.4988 0.5012 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The point is that if you create a Bloom filter and fill it to >>>> 10% >>>>>>>> of >>>>>>>>>>> the >>>>>>>>>>>>>>> intended capacity the number of indices used will be about >> 6-7% >>>>>> of >>>>>>>>>>> the >>>>>>>>>>>>>>> filter bits. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So how to store the counts? Currently the counting bloom >> filter >>>>>>>> uses >>>>>>>>>>> a >>>>>>>>>>>>>>> TreeMap<Integer, Integer>. I tried: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> TreeMap<Integer, Integer> >>>>>>>>>>>>>>> HashMap<Integer, Integer> >>>>>>>>>>>>>>> TreeSet<MutableCount> >>>>>>>>>>>>>>> HashSet<MutableCount> >>>>>>>>>>>>>>> int[] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The MutableCount is a custom object that stores the bit index >>>> and >>>>>>>>>>> uses >>>>>>>>>>>>> it >>>>>>>>>>>>>>> for a hash code and then has a mutable integer count field. >> It >>>>>>>> allows >>>>>>>>>>>>> the >>>>>>>>>>>>>>> count to be incremented/decremented if the object is in the >>>> set: >>>>>>>>>>>>>>> static final class MutableCount implements >>>>>>>>>>> Comparable<MutableCount> { >>>>>>>>>>>>>>> final int key; >>>>>>>>>>>>>>> int count; >>>>>>>>>>>>>>> // etc >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is adapted from the Bag<T> collection which stores an >> item >>>>>>>> count >>>>>>>>>>>>> with >>>>>>>>>>>>>>> a MutableInteger. Here the mutable count is part of the same >>>>>>>> object T >>>>>>>>>>>>>>> inserted into the Set. So you can find the object, change the >>>>>> count >>>>>>>>>>> and >>>>>>>>>>>>> not >>>>>>>>>>>>>>> have to put it back into the set. This is more efficient than >>>>>>>>>>> changing >>>>>>>>>>>>> the >>>>>>>>>>>>>>> Integer stored in a Map. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’ve estimated memory usage using an idea based on this >> article >>>>>>>> from >>>>>>>>>>>>>>> JavaWorld: Java Tip 130: Do you know your data size? [1]. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Basically you: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - create an object and throw it away. All classes are then >>>>>>>>>>> initialised. >>>>>>>>>>>>>>> - Then you free memory (run garbage collection) and get the >>>>>> current >>>>>>>>>>>>> memory >>>>>>>>>>>>>>> usage >>>>>>>>>>>>>>> - Then create a lot of your object (held in an array) >>>>>>>>>>>>>>> - Then measure memory usage again >>>>>>>>>>>>>>> - memory = (after - before) / count >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here is some example output for n bits set in size m: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer> = >>>>>> 733947 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer> = >>>>>> 1467866 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 13107 / 262144 (0.050) : TreeSet<MutableCount> = >>>>>> 838928 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 26214 / 262144 (0.100) : TreeSet<MutableCount> = >>>>>> 1677776 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 13107 / 262144 (0.050) : HashMap<Integer, Integer> = >>>>>> 1677712 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 26214 / 262144 (0.100) : HashMap<Integer, Integer> = >>>>>> 2306739 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 13107 / 262144 (0.050) : HashSet<MutableCount> = >>>>>> 1782664 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 26214 / 262144 (0.100) : HashSet<MutableCount> = >>>>>> 2516656 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 0 / 262144 (0.000) : int[] = >>>>>> 1048608 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> 0 / 262144 (0.000) : short[] = >>>>>> 524320 >>>>>>>>>>> bytes >>>>>>>>>>>>>>> The estimate is accurate to 0.0001% for the arrays so the >>>> method >>>>>> is >>>>>>>>>>>>>>> working. The HashMap was created with the capacity set to the >>>>>>>>>>> expected >>>>>>>>>>>>>>> capacity of the filter (m). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’ve chosen these sizes because at 5% full a HashSet is less >>>>>> memory >>>>>>>>>>>>>>> efficient than using a fixed size array, and at 10% the >> TreeSet >>>>>> is >>>>>>>>>>> also >>>>>>>>>>>>>>> less efficient. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Note that the java.util.Tree/HashSet versions just wrap a Map >>>> and >>>>>>>>>>>>> insert a >>>>>>>>>>>>>>> dummy object for all keys in the Map. So here a Set is not as >>>>>>>>>>> efficient >>>>>>>>>>>>> as >>>>>>>>>>>>>>> a Map because in the Map test I always inserted the same >>>> Integer >>>>>>>>>>> object >>>>>>>>>>>>>>> representing 1. This would be the same as using a Set with an >>>>>>>> Integer >>>>>>>>>>>>> key >>>>>>>>>>>>>>> but here the Set had to contain the MutableCount which has an >>>>>> extra >>>>>>>>>>> int >>>>>>>>>>>>>>> field and is larger than an Integer. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These data lead me to think that a counting Bloom filter >> should >>>>>>>> just >>>>>>>>>>>>> use a >>>>>>>>>>>>>>> fixed size backing array: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - If created using the same Shape as a standard Bloom filter >> it >>>>>>>> uses >>>>>>>>>>> a >>>>>>>>>>>>>>> fixed size. This has high memory cost when the filter is >> empty >>>>>> but >>>>>>>>>>> when >>>>>>>>>>>>> it >>>>>>>>>>>>>>> exceeds 10% of the intended capacity it is more efficient >> than >>>> a >>>>>>>>>>> dynamic >>>>>>>>>>>>>>> backing storage. >>>>>>>>>>>>>>> - All operations will operate in order(n) time for an >> operation >>>>>>>> with >>>>>>>>>>>>>>> another filter with n indices. Each individual index count in >>>> the >>>>>>>>>>> filter >>>>>>>>>>>>>>> will have order(1) time for access/update. Performance will >> be >>>>>>>>>>> limited >>>>>>>>>>>>> by >>>>>>>>>>>>>>> the memory cache of the entire array. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The issue is that a counting Bloom filter with a very low >>>> number >>>>>> of >>>>>>>>>>>>>>> inserted items will be memory inefficient. But under what >>>>>>>>>>> circumstance >>>>>>>>>>>>> will >>>>>>>>>>>>>>> such a filter be used in a short-term lifecycle? If it is >>>> simply >>>>>> to >>>>>>>>>>>>> merge >>>>>>>>>>>>>>> into another filter then this can be done using a merge with >> a >>>>>>>>>>> Hasher. >>>>>>>>>>>>> If >>>>>>>>>>>>>>> counts are to be merged then perhaps we provide a method to >>>> merge >>>>>>>>>>> counts >>>>>>>>>>>>>>> using the same data structure returned by the >>>> CountingBloomFilter >>>>>>>>>>>>>>> getCounts() method, e.g. using a stream of <index,count> >> pairs: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stream<int[]> getCounts(); >>>>>>>>>>>>>>> void add(Stream<int[]> counts); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The issue here is the Shape and HashFunctionIdentity of the >>>>>> origin >>>>>>>> of >>>>>>>>>>>>> the >>>>>>>>>>>>>>> merge cannot be validated. So just leave it out and use the >>>> merge >>>>>>>>>>> with a >>>>>>>>>>>>>>> Hasher. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thus the next issue with the counting Bloom filter >>>>>> implementation. >>>>>>>>>>>>>>> Currently when it merges with a Hasher it puts all the >> indices >>>>>>>> into a >>>>>>>>>>>>> Set >>>>>>>>>>>>>>> and so will only increment the count by 1 for each index >>>>>> identified >>>>>>>>>>> by >>>>>>>>>>>>> the >>>>>>>>>>>>>>> Hasher. This appears to miss the entire point of the counting >>>>>> Bloom >>>>>>>>>>>>> filter. >>>>>>>>>>>>>>> If I hash an objects to generate k indices I would hope that >> I >>>> do >>>>>>>>>>> get k >>>>>>>>>>>>>>> indices. But the hash may not be perfect and I may get [1, k] >>>>>>>> indices >>>>>>>>>>>>> with >>>>>>>>>>>>>>> some duplications. This is part of the signature of that >> object >>>>>>>> with >>>>>>>>>>> the >>>>>>>>>>>>>>> given hash. So surely a counting Bloom filter should >>>> accommodate >>>>>>>>>>> this. >>>>>>>>>>>>> If >>>>>>>>>>>>>>> my Hasher generates the same index 20 times this should >> result >>>> in >>>>>>>> the >>>>>>>>>>>>> count >>>>>>>>>>>>>>> of that index incrementing by 20. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The result if that if an object is added direct to a counting >>>>>> Bloom >>>>>>>>>>>>> filter >>>>>>>>>>>>>>> using a Hasher it will have a different result that if added >>>> to a >>>>>>>>>>>>> standard >>>>>>>>>>>>>>> Bloom filter and then that filter added to the counting Bloom >>>>>>>> filter. >>>>>>>>>>>>>>> Opinions on this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Alex >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html >>>>>>>> < >>>>>>>>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert < >>>>>>>>>>>>> alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have created a PR that contains most of the changes >>>> discussed >>>>>>>> in >>>>>>>>>>>>> this >>>>>>>>>>>>>>>>> thread (see [1]). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please review the changes and comment here or on GitHub. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have left the CountingBloomFilter alone. I will >> reimplement >>>>>> the >>>>>>>>>>>>>>>>> add/subtract functionality as discussed into another PR. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Alex >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] https://github.com/apache/commons-collections/pull/133 >> < >>>>>>>>>>> https://github.com/apache/commons-collections/pull/133> < >>>>>>>>>>>>>>>>> https://github.com/apache/commons-collections/pull/133 < >>>>>>>>>>> https://github.com/apache/commons-collections/pull/133>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> I like: Like Like - The likeliest place on the web >>>>>>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>> >>>>>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren < >>>>>>>>>>> http://www.linkedin.com/in/claudewarren> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> I like: Like Like - The likeliest place on the web >>>>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/> < >>>>>>>>>>> http://like-like.xenei.com/ <http://like-like.xenei.com/>>> >>>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren < >>>>>>>>>>> http://www.linkedin.com/in/claudewarren> < >>>>>>>>>>>>> http://www.linkedin.com/in/claudewarren < >>>>>>>>>>> http://www.linkedin.com/in/claudewarren>> >>>>>>>>>>>> -- >>>>>>>>>>>> I like: Like Like - The likeliest place on the web >>>>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>> >>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren < >>>>>>>>>>> http://www.linkedin.com/in/claudewarren> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> I like: Like Like - The likeliest place on the web >>>>>>>>>> <http://like-like.xenei.com> >>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren >>>>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>>>> >>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>> >>>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>>> >> > > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren