hey guys , i am manas , 2nd year computer engineering student, this is my first time in GSoC, could someone help me with project idea?
On Mon, Mar 2, 2020 at 6:42 PM Alex Herbert <alex.d.herb...@gmail.com> wrote: > > On 02/03/2020 11:32, Claude Warren wrote: > > my thought on changing the BloomFilter.merge() to return a boolean is > along > > the lines you had: return true on successful merge (even if there are no > > changes in the enabled bits). And yes, for most cases the standard bloom > > filter will return true, but the change is really to support extensions > to > > Bloom filter so I think it is reasonable. > > > > As for the getCounts(). Suppose we split it into two methods: > > > > // return the indexes that are enabled. This is equivalent to > > CountingBloomFilter.getHasher().getBits( CountingBloomFilter.getShape() > ); > > Iterator<int> getIndexes() > > // get the count for the specific index. > > int getCount( int index ); > > > > With these methods It becomes possible to construct an iterator of int[] > or > > Map.Entry<Integer,Integer> or whatever else the developer wants. > > Having to call getCount() for every index is likely to be slower than > all the garbage collection involved with iterating a disposable <index, > count> pair. For an array backed storage the access will be order(1). > But other storage may be a lot slower. This method ideally should > traverse the storage once and make each <index, count> pair available. > Ideally this would be without creating each <index, count> pair as a > disposable object. > > Following on from previously, another disadvantage of the consumer > approach is the lack of fast fail. You can remedy this by adding a > boolean return from the consumer to indicate if you want to consume more > items: > > interface BitCountConsumer { > boolean accept(int index, int count); > } > > A CountingBloomFilter implementation then has something like: > > void getCounts(BitCountConsumer consumer) { > while(hasNext()) { > next(); > if (!consumer.accept(index(), count()) { > break; > } > } > } > > Looking at the getHasher().getBits(...) idea I noticed that StaticHasher > is constructed using: > > Iterator<Integer> iter > > not: > > PrimitiveIterator.OfInt > > I'd prefer the primitive specialisation in this constructor. It would > break the HasherBloomFilter merge function but that can be fixed. > However it may be redundant. The StaticHasher has some oddities with the > available constructors: > > public StaticHasher(final StaticHasher hasher, final Shape shape) > > - why have a shape parameter? The constructor checks the shape is the > same as that for the hasher. So why not just a plain copy constructor: > > public StaticHasher(final StaticHasher hasher) > > This constructor is only used in the unit tests. Given the StaticHasher > is immutable then a copy constructor seems odd. I think it should be > dropped. > > > Here are is a constructor that is missing IMO: > > public StaticHasher(final Set<Integer> indices, final Shape shape) > > - Could be used to generically create a static hasher. The indices are > ensured to be unique by the Set but should be checked to be within the > shape. > > > With these constructors the public StaticHasher(final Iterator<Integer> > iter, final Shape shape) is only used by the BitSetBloomFilter: > > public StaticHasher getHasher() { > return new StaticHasher(bitSet.stream().iterator(), getShape()); > } > > Becomes: > > public StaticHasher getHasher() { > return new > StaticHasher(bitSet.stream().boxed().collect(Collectors.toCollection(TreeSet::new)), > > getShape()); > } > > This is a very roundabout way to construct a StaticHasher. Perhaps a > constructor accepting a BitSet would be useful? > > So the API only requires: > > public StaticHasher(final Hasher hasher, final Shape shape) > public StaticHasher(final Set<Integer> indices, final Shape shape) > // Maybe > public StaticHasher(final BitSet indices, final Shape shape) > > > I wondered about a merge function for the StaticHasher: > > public StaticHasher merge(final StaticHasher other) > > The merge function can make some assumptions about the two arrays to > merge as the class is final and the values are verified to be sorted and > within the Shape. However there is not a use for it other than in the > merge of two HasherBloomFilters, which is not a recommended operation > anyway. > > > > Claude > > > > On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert <alex.d.herb...@gmail.com> > > wrote: > > > >> On 02/03/2020 09:38, Claude Warren wrote: > >>> It is not too late to update the BloomFIlter interface to have the > merge > >>> return a boolean. The incorrect Shape would still throw an exception, > so > >>> the return value would only come into play if the bits could not be > set. > >>> > >>> thoughts? > >> I don't see the harm in it. But what would the return value be for? > >> > >> For a standard collection it would be if the collection was changed by > >> the operation: > >> > >> Collection.add/remove return "true if this collection changed as a > >> result of the call" > >> > >> So here is the equivalent: > >> > >> return "true if this filter was changed as a result of the call" > >> > >> This is computationally slow to track. It also is confusing if the > >> filter was successfully merged but no bits were changed to then return > >> false because the filter was actually incorporated. So it would go along > >> the lines that we discussed for the counting Bloom filter: > >> > >> return "true if this filter was successfully merged as a result of the > >> call" > >> > >> For most cases in the current library it would be true when an exception > >> is not thrown. However the merge of the counting Bloom filter may have > >> reason to return false, e.g. overflow. > >> > >>> On Mon, Mar 2, 2020 at 7:56 AM Claude Warren <cla...@xenei.com> wrote: > >>> > >>>> for the remove(), add(), and subtract() methods I agree that void is > not > >>>> correct and it should be boolean and be the same as the value you > would > >> get > >>>> from calling isValid(). > >>>> > >>>> You are correct the getCounts() should return an iterator of some type > >> on > >>>> int[], I don't know why I thought long[]. I am happy with a plain > >>>> Iterator<int[]> as the return. > >> For the getCounts() method I am still looking for a way around having to > >> create an <index, count> pair for everything in the filter. An > >> alternative to an iterator is to use the consumer idea. Given there is > >> no primitive specialisation of BiConsumer<T, U> in JDK 8 functions we > >> define our own: > >> > >> interface BitCountConsumer { > >> void accept(int index, int count); > >> } > >> > >> The CountingBloomFilter then has: > >> > >> void forEachCount(BitCountConsumer consumer); > >> // Or > >> void getCounts(BitCountConsumer consumer); > >> > >> > >> You can then directly pass the counts from the backing storage to the > >> destination. > >> > >> Advantages: > >> - No object creation > >> Disadvantages > >> - The counts cannot be streamed > >> > >> An alternative is to provide an Iterator of long with the index and > >> count packed into a long with methods to extract them: > >> > >> PrimativeIterator.OfLong getCounts(); > >> > >> default static int getIndex(long indexCountPair) { > >> return (int) (indexCountPair >>> 32); > >> } > >> > >> default static int getCount(long indexCountPair) { > >> return (int) indexCountPair; > >> } > >> > >> This will almost certainly be a cause for bugs/issues from users. > >> > >> I believe that the counts will be used for 2 main use cases: > >> > >> 1. Storage > >> > >> 2. Adding to another counting Bloom filter > >> > >> Both cases are likely to be done serially and not in parallel. So > >> providing a consumer based API to receive the counts would work. > >> > >> WDYT? > >> > >>>> Claude > >>>> > >>>> > >>>> > >>>> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert <alex.d.herb...@gmail.com > > > >>>> wrote: > >>>> > >>>>>> On 1 Mar 2020, at 15:39, Claude Warren <cla...@xenei.com> wrote: > >>>>>> > >>>>>> I think the CountingBloomFilter interface needs to extend > BloomFilter. > >>>>> I said that but did not write it, sorry. > >>>>> > >>>>>> I think I am confused. > >>>>>> > >>>>>> I would expect CountingBloomFilter to have > >>>>>> > >>>>>> interface CountingBloomFilter extends BloomFilter { > >>>>>> // these 2 methods are the reverse of merge() > >>>>>> void remove( BloomFilter ); > >>>>>> void remove( Hasher ); > >>>>> Fine. Same intention but different method names. But why void? It > >> forces > >>>>> you to check if the remove was valid with a second call. On the plus > >> side > >>>>> it matches the void merge(…) methods and in most cases a user would > not > >>>>> care to check anyway. If they are controlling the filter then leave > it > >> to > >>>>> them to make sure they do not remove something they did not add. > >>>>> > >>>>>> // these 2 methods are the addition/subtraction of counts > >>>>>> void add( CountingBloomFilter ) > >>>>>> void subtract( CountingBloomFilter ); > >>>>> Fine. But same comment as above with the void return. > >>>>> > >>>>>> // 2 methods to retrieve data > >>>>>> Stream<long[]> getCounts(); > >>>>> I don’t like this use of long[]. In my previous post I argued that if > >> you > >>>>> were to ever want to store more than max integer items in a filter > >> then the > >>>>> Bloom filter would have more bit indices than max integer. So we > never > >> have > >>>>> to support long counts. A filter that exceeds max integer for a count > >> is > >>>>> highly likely to be saturated and no use as a filter anyway. > >>>>> > >>>>> For most backing implementations the object type of the stream will > be > >>>>> different so you will have to write a Spliterator<T> implementation > or > >> wrap > >>>>> some iterator anyway. So why not return the Spliterator: > >>>>> > >>>>> Spliterator<int[]> getCounts(); > >>>>> > >>>>> Since the backing implementation will likely not store int[] pairs > then > >>>>> this will have a lot of object creation and garbage collection > >> overhead to > >>>>> go through the counts. This does not seem to be a big concern here if > >> the > >>>>> purpose is the same as for the BloomFilter for long[] getBits(), i.e. > >> to > >>>>> get a canonical representation for storage. > >>>>> > >>>>> Note: The Spliterator may not have a known size (number of non zero > >> bits) > >>>>> at creation, for example if the counts are stored in a fixed size > >> array. > >>>>> Thus efficient parallel traversal by binary tree splitting is limited > >> by > >>>>> how evenly the counts are distributed. For a backing implementation > >> using a > >>>>> collection then the size should be known. In this case a Spliterator > >> would > >>>>> be of more use than a plain Iterator. You can convert one to the > other > >>>>> using: > >>>>> > >>>>> java.util.Spliterators: > >>>>> public static<T> Iterator<T> iterator(Spliterator<? extends T> > >>>>> spliterator) > >>>>> public static <T> Spliterator<T> spliterator(Iterator<? extends T> > >>>>> iterator, > >>>>> long size, > >>>>> int > characteristics) > >>>>> > >>>>> So which to choose for the API? > >>>>> > >>>>> The Hasher currently uses an Iterator: > >>>>> > >>>>> PrimitiveIterator.OfInt getBits(Shape shape); > >>>>> > >>>>> In the case of a StaticHasher this could return a spliterator. But > the > >>>>> DynamicHasher needs a reworking of the internal Iterator class. It > >> could be > >>>>> a Spliterator to use the new IntConsumer API but in most (all) cases > >>>>> splitting would not be possible for dynamic hashing as the parts are > >>>>> produced in order. It is likely that they will be consumed > >> sequentially too. > >>>>> I would suggest that Spliterator is the more modern implementation, > >>>>> despite not always being applicable to parallelisation in a stream. > >>>>> Currently the iterator from the Hasher is used in forEachRemaining() > >> and > >>>>> while loop is approximately equal measure. The while loops are for a > >> fast > >>>>> exit and would be uglier if rewritten for a > >>>>> Spliterator.tryAdvance(IntConsumer) syntax. > >>>>> > >>>>> There is a use of the IteratorChain in HasherBloomFilter that would > >> need > >>>>> a rethink for spliterators. > >>>>> > >>>>> The path of least resistance is to use Iterator<int[]> for the API of > >>>>> CountingBloomFilter to be consistent with Hasher’s use of Iterator. > >>>>> > >>>>> WDYT? > >>>>> > >>>>> > >>>>>> boolean isValid() > >>>>> Fine. Allows some level of feedback that the counts are corrupt. > >>>>> > >>>>>> } > >>>>>> > >>>>>> Claude > >>>>>> > >>>>>> > >>>>>> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert < > alex.d.herb...@gmail.com > >>>>> <mailto:alex.d.herb...@gmail.com>> > >>>>>> wrote: > >>>>>> > >>>>>>>> On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> wrote: > >>>>>>>> > >>>>>>>> The idea of a backing array is fine and the only problem I see > with > >>>>> it is > >>>>>>>> in very large filters (on the order of 10^8 bits and larger) but > >>>>> document > >>>>>>>> the size calculation and let the developer worry about it. > >>>>>>> Let us look at the use case where we max out the array. Using the > >> Bloom > >>>>>>> filter calculator: > >>>>>>> > >>>>>>> n = 149,363,281 > >>>>>>> p = 0.001000025 (1 in 1000) > >>>>>>> m = 2147483647 (256MiB) > >>>>>>> k = 10 > >>>>>>> > >>>>>>> n = 74,681,641 > >>>>>>> p = 0.000001 (1 in 999950) > >>>>>>> m = 2147483647 (256MiB) > >>>>>>> k = 20 > >>>>>>> > >>>>>>> n = 49,787,761 > >>>>>>> p = 0.000000001 (1 in 999924899) > >>>>>>> m = 2147483647 (256MiB) > >>>>>>> k = 30 > >>>>>>> > >>>>>>> So you will be able to put somewhere in the order of 10^8 or 10^7 > >> items > >>>>>>> into the filter. I would say that anyone putting more than that > into > >>>>> the > >>>>>>> filter has an unusual use case. The CountingBloomFilter can throw > an > >>>>>>> exception if m is too large and will throw an OutOfMemoryError if > you > >>>>>>> cannot allocate an array large enough. > >>>>>>> > >>>>>>> One clear point here is that you cannot use a short as a 16-bit > count > >>>>>>> would easily overflow. So you have to use an integer array for the > >>>>> counts. > >>>>>>> A maximum length int[] is roughly 8GB. > >>>>>>> > >>>>>>> What would another implementation cost in terms of memory? The > >>>>>>> TreeMap<Integer, Integer> was the most space efficient. In the > >> previous > >>>>>>> e-mail the saturation of a Bloom filter bits was approximately 50% > >>>>> when at > >>>>>>> the intended capacity. So we have to estimate the size of a TreeMap > >>>>>>> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test > shows > >>>>> the > >>>>>>> TreeMap memory scales linearly with entries: > >>>>>>> > >>>>>>> 32768 / 65536 (0.500) : TreeMap<Integer, Integer> = 1834061 > >>>>> bytes > >>>>>>> 65536 / 131072 (0.500) : TreeMap<Integer, Integer> = 3669080 > >>>>> bytes > >>>>>>> 131072 / 262144 (0.500) : TreeMap<Integer, Integer> = 7339090 > >>>>> bytes > >>>>>>> So what is the memory for a TreeMap with 2^30 indices. I make it > >> about: > >>>>>>> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB > >>>>>>> > >>>>>>> I would say that this amount of RAM is unusual. It is definitely > not > >> as > >>>>>>> efficient as using an array. So very large counting Bloom filters > are > >>>>> going > >>>>>>> to require some thought as to the hardware they run on. This may > not > >>>>> be the > >>>>>>> case in 10 years time. > >>>>>>> > >>>>>>> I would say that we try an int[] backing array for the storage > >>>>>>> implementation and document it’s limitations. A different > >>>>> implementation > >>>>>>> could be provided in future if required. > >>>>>>> > >>>>>>> This could be done by making CountingBloomFilter an interface that > >>>>> extends > >>>>>>> BloomFilter with the methods: > >>>>>>> > >>>>>>> subtract(BloomFilter filter) > >>>>>>> subtract(Hasher filter) > >>>>>>> > >>>>>>> These will negate the effect of the corresponding > merge(BloomFilter) > >>>>>>> operation. > >>>>>>> > >>>>>>> Do we also need access to the counts and add/subtract of another > >>>>>>> CountingBloomFilter?: > >>>>>>> > >>>>>>> add(CountingBloomFilter filter); > >>>>>>> subtract(CountingBloomFilter filter); > >>>>>>> > >>>>>>> Iterator<int[]> getCounts(); > >>>>>>> int getSize(); // Number of items added > >>>>>>> > >>>>>>> The CountingBloomFilter is then an interface that defines how to > >>>>> reverse > >>>>>>> the merge of some bits into the filter. > >>>>>>> > >>>>>>> My concern is the inefficiency of the creation of objects in any > >> method > >>>>>>> that provides access to the counts (e.g. above using an iterator as > >> for > >>>>>>> Hasher.getBits()). I presume this method would be to allow some > type > >> of > >>>>>>> storage/serialisation of the filter akin to the long[] getBits() > >>>>> method of > >>>>>>> BloomFilter. So it may be better done using a method: > >>>>>>> > >>>>>>> int getCount(int index); > >>>>>>> > >>>>>>> The caller can then use long[] getBits() to get the indices set in > >> the > >>>>>>> filter and then for each non-zero bit index call getCount(index). > Or > >>>>> just > >>>>>>> not include the method as the counts are only of concern when > storing > >>>>> the > >>>>>>> filter. This functionality is cleaner pushed into an > implementation. > >>>>>>> > >>>>>>> In a previous post we discussed whether to throw an exception on > >>>>>>> overflow/underflow or raise in an invalid flag. Using the invalid > >> flag > >>>>> idea > >>>>>>> the interface would be: > >>>>>>> > >>>>>>> interface CountingBloomFilter { > >>>>>>> int add(CountingBloomFilter filter); > >>>>>>> int subtract(BloomFilter filter); > >>>>>>> int subtract(Hasher filter); > >>>>>>> int subtract(CountingBloomFilter filter); > >>>>>>> int getStatus(); > >>>>>>> > >>>>>>> // Maybe > >>>>>>> int getSize(); > >>>>>>> int getCount(int index); > >>>>>>> } > >>>>>>> > >>>>>>> The status would be negative if any operation > overflowed/underflowed, > >>>>> or > >>>>>>> zero if OK. The current status is returned by the add/subtract > >> methods. > >>>>>>> However I note that overflow may not be a concern. The number of > >> items > >>>>> to > >>>>>>> add to a filter to create overflow would be using a filter with a > >>>>> number of > >>>>>>> bits that is unrealistic to store in memory: > >>>>>>> > >>>>>>> n = 2147483647 > >>>>>>> p = 0.001000025 (1 in 1000) > >>>>>>> m = 30875634182 (3.59GiB) > >>>>>>> k = 10 > >>>>>>> > >>>>>>> If you want to add 2 billion items (and overflow an integer count) > >> then > >>>>>>> your filter would be so big it would break the rest of the API that > >>>>> uses a > >>>>>>> 32-bit int for the bit index. > >>>>>>> > >>>>>>> Thus only underflow is a realistic concern. This could be > documented > >> as > >>>>>>> handled in an implementation specific manner (i.e. throw or > ignore). > >>>>> The > >>>>>>> API is then simplified to: > >>>>>>> > >>>>>>> interface CountingBloomFilter { > >>>>>>> boolean add(CountingBloomFilter filter); > >>>>>>> boolean subtract(BloomFilter filter); > >>>>>>> boolean subtract(Hasher filter); > >>>>>>> boolean subtract(CountingBloomFilter filter); > >>>>>>> int getStatus(); > >>>>>>> > >>>>>>> // Maybe > >>>>>>> int getSize(); > >>>>>>> int getCount(int index); > >>>>>>> } > >>>>>>> > >>>>>>> The boolean is used to state that add/subtract did not > >> over/underflow. > >>>>>>> Implementations can throw if they require it. > >>>>>>> > >>>>>>> The question then becomes what does getSize() represent if an > >>>>> add/subtract > >>>>>>> did not execute cleanly. Under this scheme it would be the number > of > >>>>> (add - > >>>>>>> subtract) operations. The status flag would be used to indicate if > >> the > >>>>> size > >>>>>>> is valid, or any of the counts from getCount(). The simpler API is > to > >>>>> not > >>>>>>> allow access to counts/size or adding/subtracting counts: > >>>>>>> > >>>>>>> interface CountingBloomFilter { > >>>>>>> boolean subtract(BloomFilter filter); > >>>>>>> boolean subtract(Hasher filter); > >>>>>>> int getStatus(); > >>>>>>> // Or something like ... > >>>>>>> boolean isValid(); > >>>>>>> } > >>>>>>> > >>>>>>> This filter is then only concerned with reversing the merge of > Bloom > >>>>>>> filters with a valid status flag to indicate that the current state > >> is > >>>>>>> consistent (i.e. all filters have been cleanly merged/subtracted). > >>>>>>> > >>>>>>> WDYT? > >>>>>>> > >>>>>>>> As for the merge question. merge is a standard bloom filter > >>>>> operation. > >>>>>>> It > >>>>>>>> is well defined in the literature. Merging a bloom filter into a > >>>>>>> counting > >>>>>>>> bloom filter means incrementing the bit counts. I think that > >>>>>>> merge/remove > >>>>>>>> should continue to operate as though the parameter were a standard > >>>>> bloom > >>>>>>>> filter. > >>>>>>>> > >>>>>>> OK. So the count is to represent the number of filters that had a > bit > >>>>> set > >>>>>>> at that index. This makes it more clear. > >>>>>>> > >>>>>>>> We had spoken of implementing and adding/deleting method pair that > >>>>> would > >>>>>>>> operate on CountingBloomFilters and would add/subtract the counts. > >>>>> (e.g. > >>>>>>>> add(CountingBloomFilter) and subtract(CountingBloomFilter)) > >>>>>>>> > >>>>>>>> I disagree with your proposal for the merge(Hasher) > implementation, > >>>>> and I > >>>>>>>> am not certain that an add(Hasher) makes sense. First consider > that > >>>>> the > >>>>>>>> Hasher returns the bits that are to be enabled in the Bloom filter > >> so > >>>>>>>> collisions are expected. In the normal course of events a Hasher > is > >>>>> used > >>>>>>>> to create a normal Bloom filter where all the duplicates are > >> removed. > >>>>>>> That > >>>>>>>> filter is then merged into a CountingBloomFilter. So in some > sense > >>>>> the > >>>>>>>> Hasher and the normal Bloom filter are the same. So I would > expect > >>>>> the > >>>>>>>> merge of the Hasher and the merge of the normal Bloom filter > created > >>>>> from > >>>>>>>> that hasher into a CountingBloomFilter to yield the same result. > If > >>>>> you > >>>>>>>> wanted to add an add(Hasher)/delete(Hasher) pair of functions to a > >>>>>>>> CountingBloomFilter you could implement with duplicate counting, > but > >>>>> I am > >>>>>>>> not certain of the validity of such a count and I fear that it > >> muddies > >>>>>>> the > >>>>>>>> waters with respect to what the CountingBloomFilter is counting. > >>>>>>> Agreed. > >>>>>>> > >>>>>>>> Claude > >>>>>>>> > >>>>>>>> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert < > >>>>> alex.d.herb...@gmail.com > >>>>>>> <mailto:alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com > >>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com > >> <mailto: > >>>>> cla...@xenei.com> <mailto: > >>>>>>> cla...@xenei.com <mailto:cla...@xenei.com>>> wrote: > >>>>>>>>>> Alex would you take a look at pull request 131 [1]. it adds a > new > >>>>>>> hasher > >>>>>>>>>> implementation and makes the HashFunctionValidator available for > >>>>> public > >>>>>>>>> use. > >>>>>>>>>> https://github.com/apache/commons-collections/pull/131 < > >>>>> https://github.com/apache/commons-collections/pull/131> < > >>>>>>> https://github.com/apache/commons-collections/pull/131 < > >>>>> https://github.com/apache/commons-collections/pull/131>> < > >>>>>>>>> https://github.com/apache/commons-collections/pull/131 < > >>>>> https://github.com/apache/commons-collections/pull/131> < > >>>>>>> https://github.com/apache/commons-collections/pull/131 < > >>>>> https://github.com/apache/commons-collections/pull/131>>> > >>>>>>>>> OK. I’ll take a look. > >>>>>>>>> > >>>>>>>>> I’ve been thinking about the counting Bloom filter and the > backing > >>>>>>>>> storage. In summary: > >>>>>>>>> > >>>>>>>>> 1. The backing storage should be a fixed array. > >>>>>>>>> 2. Merging a Hasher should count duplicate indices, not flatten > >> them > >>>>> all > >>>>>>>>> to a single count. > >>>>>>>>> > >>>>>>>>> For background I’ve used the standard formulas to estimate the > >>>>> number of > >>>>>>>>> indices that will be non-zero in a Bloom filter. The wikipedia > page > >>>>>>> gives > >>>>>>>>> this formula for the expected number of bits set to 0 (E(q)) if > you > >>>>> have > >>>>>>>>> inserted i elements into a filter of size m using k hash > functions: > >>>>>>>>> > >>>>>>>>> E(q) = (1 - 1/m)^ki" > >>>>>>>>> > >>>>>>>>> So a rough guess of the number of indices (bits) used by a filter > >> is > >>>>>>>>> 1-E(q). > >>>>>>>>> > >>>>>>>>> Here is a table of Bloom filters with different collision > >>>>> probabilities > >>>>>>>>> and the proportion of bits that will be set when 1%, 10%, 100% of > >> the > >>>>>>>>> capacity of the filter has been met: > >>>>>>>>> > >>>>>>>>> n p m k I E(q) bits > >>>>>>>>> 1000 1E-04 19171 13 10 0.9932 0.0068 > >>>>>>>>> 1000 1E-04 19171 13 100 0.9344 0.0656 > >>>>>>>>> 1000 1E-04 19171 13 1000 0.5076 0.4924 > >>>>>>>>> 1000 1E-05 23963 17 10 0.9929 0.0071 > >>>>>>>>> 1000 1E-05 23963 17 100 0.9315 0.0685 > >>>>>>>>> 1000 1E-05 23963 17 1000 0.4919 0.5081 > >>>>>>>>> 1000 1E-06 28756 20 10 0.9931 0.0069 > >>>>>>>>> 1000 1E-06 28756 20 100 0.9328 0.0672 > >>>>>>>>> 1000 1E-06 28756 20 1000 0.4988 0.5012 > >>>>>>>>> 10000 1E-06 287552 20 100 0.9931 0.0069 > >>>>>>>>> 10000 1E-06 287552 20 1000 0.9328 0.0672 > >>>>>>>>> 10000 1E-06 287552 20 10000 0.4988 0.5012 > >>>>>>>>> > >>>>>>>>> The point is that if you create a Bloom filter and fill it to 10% > >> of > >>>>> the > >>>>>>>>> intended capacity the number of indices used will be about 6-7% > of > >>>>> the > >>>>>>>>> filter bits. > >>>>>>>>> > >>>>>>>>> So how to store the counts? Currently the counting bloom filter > >> uses > >>>>> a > >>>>>>>>> TreeMap<Integer, Integer>. I tried: > >>>>>>>>> > >>>>>>>>> TreeMap<Integer, Integer> > >>>>>>>>> HashMap<Integer, Integer> > >>>>>>>>> TreeSet<MutableCount> > >>>>>>>>> HashSet<MutableCount> > >>>>>>>>> int[] > >>>>>>>>> > >>>>>>>>> The MutableCount is a custom object that stores the bit index and > >>>>> uses > >>>>>>> it > >>>>>>>>> for a hash code and then has a mutable integer count field. It > >> allows > >>>>>>> the > >>>>>>>>> count to be incremented/decremented if the object is in the set: > >>>>>>>>> > >>>>>>>>> static final class MutableCount implements > >>>>> Comparable<MutableCount> { > >>>>>>>>> final int key; > >>>>>>>>> int count; > >>>>>>>>> // etc > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> This is adapted from the Bag<T> collection which stores an item > >> count > >>>>>>> with > >>>>>>>>> a MutableInteger. Here the mutable count is part of the same > >> object T > >>>>>>>>> inserted into the Set. So you can find the object, change the > count > >>>>> and > >>>>>>> not > >>>>>>>>> have to put it back into the set. This is more efficient than > >>>>> changing > >>>>>>> the > >>>>>>>>> Integer stored in a Map. > >>>>>>>>> > >>>>>>>>> I’ve estimated memory usage using an idea based on this article > >> from > >>>>>>>>> JavaWorld: Java Tip 130: Do you know your data size? [1]. > >>>>>>>>> > >>>>>>>>> Basically you: > >>>>>>>>> > >>>>>>>>> - create an object and throw it away. All classes are then > >>>>> initialised. > >>>>>>>>> - Then you free memory (run garbage collection) and get the > current > >>>>>>> memory > >>>>>>>>> usage > >>>>>>>>> - Then create a lot of your object (held in an array) > >>>>>>>>> - Then measure memory usage again > >>>>>>>>> - memory = (after - before) / count > >>>>>>>>> > >>>>>>>>> Here is some example output for n bits set in size m: > >>>>>>>>> > >>>>>>>>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer> = > 733947 > >>>>> bytes > >>>>>>>>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer> = > 1467866 > >>>>> bytes > >>>>>>>>> 13107 / 262144 (0.050) : TreeSet<MutableCount> = > 838928 > >>>>> bytes > >>>>>>>>> 26214 / 262144 (0.100) : TreeSet<MutableCount> = > 1677776 > >>>>> bytes > >>>>>>>>> 13107 / 262144 (0.050) : HashMap<Integer, Integer> = > 1677712 > >>>>> bytes > >>>>>>>>> 26214 / 262144 (0.100) : HashMap<Integer, Integer> = > 2306739 > >>>>> bytes > >>>>>>>>> 13107 / 262144 (0.050) : HashSet<MutableCount> = > 1782664 > >>>>> bytes > >>>>>>>>> 26214 / 262144 (0.100) : HashSet<MutableCount> = > 2516656 > >>>>> bytes > >>>>>>>>> 0 / 262144 (0.000) : int[] = > 1048608 > >>>>> bytes > >>>>>>>>> 0 / 262144 (0.000) : short[] = > 524320 > >>>>> bytes > >>>>>>>>> The estimate is accurate to 0.0001% for the arrays so the method > is > >>>>>>>>> working. The HashMap was created with the capacity set to the > >>>>> expected > >>>>>>>>> capacity of the filter (m). > >>>>>>>>> > >>>>>>>>> I’ve chosen these sizes because at 5% full a HashSet is less > memory > >>>>>>>>> efficient than using a fixed size array, and at 10% the TreeSet > is > >>>>> also > >>>>>>>>> less efficient. > >>>>>>>>> > >>>>>>>>> Note that the java.util.Tree/HashSet versions just wrap a Map and > >>>>>>> insert a > >>>>>>>>> dummy object for all keys in the Map. So here a Set is not as > >>>>> efficient > >>>>>>> as > >>>>>>>>> a Map because in the Map test I always inserted the same Integer > >>>>> object > >>>>>>>>> representing 1. This would be the same as using a Set with an > >> Integer > >>>>>>> key > >>>>>>>>> but here the Set had to contain the MutableCount which has an > extra > >>>>> int > >>>>>>>>> field and is larger than an Integer. > >>>>>>>>> > >>>>>>>>> These data lead me to think that a counting Bloom filter should > >> just > >>>>>>> use a > >>>>>>>>> fixed size backing array: > >>>>>>>>> > >>>>>>>>> - If created using the same Shape as a standard Bloom filter it > >> uses > >>>>> a > >>>>>>>>> fixed size. This has high memory cost when the filter is empty > but > >>>>> when > >>>>>>> it > >>>>>>>>> exceeds 10% of the intended capacity it is more efficient than a > >>>>> dynamic > >>>>>>>>> backing storage. > >>>>>>>>> - All operations will operate in order(n) time for an operation > >> with > >>>>>>>>> another filter with n indices. Each individual index count in the > >>>>> filter > >>>>>>>>> will have order(1) time for access/update. Performance will be > >>>>> limited > >>>>>>> by > >>>>>>>>> the memory cache of the entire array. > >>>>>>>>> > >>>>>>>>> The issue is that a counting Bloom filter with a very low number > of > >>>>>>>>> inserted items will be memory inefficient. But under what > >>>>> circumstance > >>>>>>> will > >>>>>>>>> such a filter be used in a short-term lifecycle? If it is simply > to > >>>>>>> merge > >>>>>>>>> into another filter then this can be done using a merge with a > >>>>> Hasher. > >>>>>>> If > >>>>>>>>> counts are to be merged then perhaps we provide a method to merge > >>>>> counts > >>>>>>>>> using the same data structure returned by the CountingBloomFilter > >>>>>>>>> getCounts() method, e.g. using a stream of <index,count> pairs: > >>>>>>>>> > >>>>>>>>> Stream<int[]> getCounts(); > >>>>>>>>> void add(Stream<int[]> counts); > >>>>>>>>> > >>>>>>>>> The issue here is the Shape and HashFunctionIdentity of the > origin > >> of > >>>>>>> the > >>>>>>>>> merge cannot be validated. So just leave it out and use the merge > >>>>> with a > >>>>>>>>> Hasher. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thus the next issue with the counting Bloom filter > implementation. > >>>>>>>>> Currently when it merges with a Hasher it puts all the indices > >> into a > >>>>>>> Set > >>>>>>>>> and so will only increment the count by 1 for each index > identified > >>>>> by > >>>>>>> the > >>>>>>>>> Hasher. This appears to miss the entire point of the counting > Bloom > >>>>>>> filter. > >>>>>>>>> If I hash an objects to generate k indices I would hope that I do > >>>>> get k > >>>>>>>>> indices. But the hash may not be perfect and I may get [1, k] > >> indices > >>>>>>> with > >>>>>>>>> some duplications. This is part of the signature of that object > >> with > >>>>> the > >>>>>>>>> given hash. So surely a counting Bloom filter should accommodate > >>>>> this. > >>>>>>> If > >>>>>>>>> my Hasher generates the same index 20 times this should result in > >> the > >>>>>>> count > >>>>>>>>> of that index incrementing by 20. > >>>>>>>>> > >>>>>>>>> The result if that if an object is added direct to a counting > Bloom > >>>>>>> filter > >>>>>>>>> using a Hasher it will have a different result that if added to a > >>>>>>> standard > >>>>>>>>> Bloom filter and then that filter added to the counting Bloom > >> filter. > >>>>>>>>> Opinions on this? > >>>>>>>>> > >>>>>>>>> Alex > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [1] > http://www.javaworld.com/javaworld/javatips/jw-javatip130.html > >> < > >>>>> http://www.javaworld.com/javaworld/javatips/jw-javatip130.html> > >>>>>>>>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert < > >>>>>>> alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> I have created a PR that contains most of the changes discussed > >> in > >>>>>>> this > >>>>>>>>>>> thread (see [1]). > >>>>>>>>>>> > >>>>>>>>>>> Please review the changes and comment here or on GitHub. > >>>>>>>>>>> > >>>>>>>>>>> I have left the CountingBloomFilter alone. I will reimplement > the > >>>>>>>>>>> add/subtract functionality as discussed into another PR. > >>>>>>>>>>> > >>>>>>>>>>> Alex > >>>>>>>>>>> > >>>>>>>>>>> [1] https://github.com/apache/commons-collections/pull/133 < > >>>>> https://github.com/apache/commons-collections/pull/133> < > >>>>>>>>>>> https://github.com/apache/commons-collections/pull/133 < > >>>>> https://github.com/apache/commons-collections/pull/133>> > >>>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> I like: Like Like - The likeliest place on the web > >>>>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>> > >>>>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren < > >>>>> http://www.linkedin.com/in/claudewarren> > >>>>>>>> -- > >>>>>>>> I like: Like Like - The likeliest place on the web > >>>>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/> < > >>>>> http://like-like.xenei.com/ <http://like-like.xenei.com/>>> > >>>>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren < > >>>>> http://www.linkedin.com/in/claudewarren> < > >>>>>>> http://www.linkedin.com/in/claudewarren < > >>>>> http://www.linkedin.com/in/claudewarren>> > >>>>>> -- > >>>>>> I like: Like Like - The likeliest place on the web > >>>>>> <http://like-like.xenei.com <http://like-like.xenei.com/>> > >>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren < > >>>>> http://www.linkedin.com/in/claudewarren> > >>>>> > >>>> -- > >>>> I like: Like Like - The likeliest place on the web > >>>> <http://like-like.xenei.com> > >>>> LinkedIn: http://www.linkedin.com/in/claudewarren > >>>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >> For additional commands, e-mail: dev-h...@commons.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >