I think the CountingBloomFilter interface needs to extend BloomFilter. I think I am confused.
I would expect CountingBloomFilter to have interface CountingBloomFilter extends BloomFilter { // these 2 methods are the reverse of merge() void remove( BloomFilter ); void remove( Hasher ); // these 2 methods are the addition/subtraction of counts void add( CountingBloomFilter ) void subtract( CountingBloomFilter ); // 2 methods to retrieve data Stream<long[]> getCounts(); boolean isValid() } Claude On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <alex.d.herb...@gmail.com> wrote: > > > > On 1 Mar 2020, at 09:28, Claude Warren <cla...@xenei.com> wrote: > > > > The idea of a backing array is fine and the only problem I see with it is > > in very large filters (on the order of 10^8 bits and larger) but document > > the size calculation and let the developer worry about it. > > Let us look at the use case where we max out the array. Using the Bloom > filter calculator: > > n = 149,363,281 > p = 0.001000025 (1 in 1000) > m = 2147483647 (256MiB) > k = 10 > > n = 74,681,641 > p = 0.000001 (1 in 999950) > m = 2147483647 (256MiB) > k = 20 > > n = 49,787,761 > p = 0.000000001 (1 in 999924899) > m = 2147483647 (256MiB) > k = 30 > > So you will be able to put somewhere in the order of 10^8 or 10^7 items > into the filter. I would say that anyone putting more than that into the > filter has an unusual use case. The CountingBloomFilter can throw an > exception if m is too large and will throw an OutOfMemoryError if you > cannot allocate an array large enough. > > One clear point here is that you cannot use a short as a 16-bit count > would easily overflow. So you have to use an integer array for the counts. > > A maximum length int[] is roughly 8GB. > > What would another implementation cost in terms of memory? The > TreeMap<Integer, Integer> was the most space efficient. In the previous > e-mail the saturation of a Bloom filter bits was approximately 50% when at > the intended capacity. So we have to estimate the size of a TreeMap > containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows the > TreeMap memory scales linearly with entries: > > 32768 / 65536 (0.500) : TreeMap<Integer, Integer> = 1834061 bytes > 65536 / 131072 (0.500) : TreeMap<Integer, Integer> = 3669080 bytes > 131072 / 262144 (0.500) : TreeMap<Integer, Integer> = 7339090 bytes > > So what is the memory for a TreeMap with 2^30 indices. I make it about: > > (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB > > I would say that this amount of RAM is unusual. It is definitely not as > efficient as using an array. So very large counting Bloom filters are going > to require some thought as to the hardware they run on. This may not be the > case in 10 years time. > > I would say that we try an int[] backing array for the storage > implementation and document it’s limitations. A different implementation > could be provided in future if required. > > This could be done by making CountingBloomFilter an interface that extends > BloomFilter with the methods: > > subtract(BloomFilter filter) > subtract(Hasher filter) > > These will negate the effect of the corresponding merge(BloomFilter) > operation. > > Do we also need access to the counts and add/subtract of another > CountingBloomFilter?: > > add(CountingBloomFilter filter); > subtract(CountingBloomFilter filter); > > Iterator<int[]> getCounts(); > int getSize(); // Number of items added > > The CountingBloomFilter is then an interface that defines how to reverse > the merge of some bits into the filter. > > My concern is the inefficiency of the creation of objects in any method > that provides access to the counts (e.g. above using an iterator as for > Hasher.getBits()). I presume this method would be to allow some type of > storage/serialisation of the filter akin to the long[] getBits() method of > BloomFilter. So it may be better done using a method: > > int getCount(int index); > > The caller can then use long[] getBits() to get the indices set in the > filter and then for each non-zero bit index call getCount(index). Or just > not include the method as the counts are only of concern when storing the > filter. This functionality is cleaner pushed into an implementation. > > In a previous post we discussed whether to throw an exception on > overflow/underflow or raise in an invalid flag. Using the invalid flag idea > the interface would be: > > interface CountingBloomFilter { > int add(CountingBloomFilter filter); > int subtract(BloomFilter filter); > int subtract(Hasher filter); > int subtract(CountingBloomFilter filter); > int getStatus(); > > // Maybe > int getSize(); > int getCount(int index); > } > > The status would be negative if any operation overflowed/underflowed, or > zero if OK. The current status is returned by the add/subtract methods. > > However I note that overflow may not be a concern. The number of items to > add to a filter to create overflow would be using a filter with a number of > bits that is unrealistic to store in memory: > > n = 2147483647 > p = 0.001000025 (1 in 1000) > m = 30875634182 (3.59GiB) > k = 10 > > If you want to add 2 billion items (and overflow an integer count) then > your filter would be so big it would break the rest of the API that uses a > 32-bit int for the bit index. > > Thus only underflow is a realistic concern. This could be documented as > handled in an implementation specific manner (i.e. throw or ignore). The > API is then simplified to: > > interface CountingBloomFilter { > boolean add(CountingBloomFilter filter); > boolean subtract(BloomFilter filter); > boolean subtract(Hasher filter); > boolean subtract(CountingBloomFilter filter); > int getStatus(); > > // Maybe > int getSize(); > int getCount(int index); > } > > The boolean is used to state that add/subtract did not over/underflow. > Implementations can throw if they require it. > > The question then becomes what does getSize() represent if an add/subtract > did not execute cleanly. Under this scheme it would be the number of (add - > subtract) operations. The status flag would be used to indicate if the size > is valid, or any of the counts from getCount(). The simpler API is to not > allow access to counts/size or adding/subtracting counts: > > interface CountingBloomFilter { > boolean subtract(BloomFilter filter); > boolean subtract(Hasher filter); > int getStatus(); > // Or something like ... > boolean isValid(); > } > > This filter is then only concerned with reversing the merge of Bloom > filters with a valid status flag to indicate that the current state is > consistent (i.e. all filters have been cleanly merged/subtracted). > > WDYT? > > > > > As for the merge question. merge is a standard bloom filter operation. > It > > is well defined in the literature. Merging a bloom filter into a > counting > > bloom filter means incrementing the bit counts. I think that > merge/remove > > should continue to operate as though the parameter were a standard bloom > > filter. > > > > OK. So the count is to represent the number of filters that had a bit set > at that index. This makes it more clear. > > > We had spoken of implementing and adding/deleting method pair that would > > operate on CountingBloomFilters and would add/subtract the counts. (e.g. > > add(CountingBloomFilter) and subtract(CountingBloomFilter)) > > > > I disagree with your proposal for the merge(Hasher) implementation, and I > > am not certain that an add(Hasher) makes sense. First consider that the > > Hasher returns the bits that are to be enabled in the Bloom filter so > > collisions are expected. In the normal course of events a Hasher is used > > to create a normal Bloom filter where all the duplicates are removed. > That > > filter is then merged into a CountingBloomFilter. So in some sense the > > Hasher and the normal Bloom filter are the same. So I would expect the > > merge of the Hasher and the merge of the normal Bloom filter created from > > that hasher into a CountingBloomFilter to yield the same result. If you > > wanted to add an add(Hasher)/delete(Hasher) pair of functions to a > > CountingBloomFilter you could implement with duplicate counting, but I am > > not certain of the validity of such a count and I fear that it muddies > the > > waters with respect to what the CountingBloomFilter is counting. > > Agreed. > > > > > Claude > > > > On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <alex.d.herb...@gmail.com > <mailto:alex.d.herb...@gmail.com>> > > wrote: > > > >> > >> > >>> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com <mailto: > cla...@xenei.com>> wrote: > >>> > >>> Alex would you take a look at pull request 131 [1]. it adds a new > hasher > >>> implementation and makes the HashFunctionValidator available for public > >> use. > >>> > >>> https://github.com/apache/commons-collections/pull/131 < > https://github.com/apache/commons-collections/pull/131> < > >> https://github.com/apache/commons-collections/pull/131 < > https://github.com/apache/commons-collections/pull/131>> > >> > >> OK. I’ll take a look. > >> > >> I’ve been thinking about the counting Bloom filter and the backing > >> storage. In summary: > >> > >> 1. The backing storage should be a fixed array. > >> 2. Merging a Hasher should count duplicate indices, not flatten them all > >> to a single count. > >> > >> For background I’ve used the standard formulas to estimate the number of > >> indices that will be non-zero in a Bloom filter. The wikipedia page > gives > >> this formula for the expected number of bits set to 0 (E(q)) if you have > >> inserted i elements into a filter of size m using k hash functions: > >> > >> E(q) = (1 - 1/m)^ki" > >> > >> So a rough guess of the number of indices (bits) used by a filter is > >> 1-E(q). > >> > >> Here is a table of Bloom filters with different collision probabilities > >> and the proportion of bits that will be set when 1%, 10%, 100% of the > >> capacity of the filter has been met: > >> > >> n p m k I E(q) bits > >> 1000 1E-04 19171 13 10 0.9932 0.0068 > >> 1000 1E-04 19171 13 100 0.9344 0.0656 > >> 1000 1E-04 19171 13 1000 0.5076 0.4924 > >> 1000 1E-05 23963 17 10 0.9929 0.0071 > >> 1000 1E-05 23963 17 100 0.9315 0.0685 > >> 1000 1E-05 23963 17 1000 0.4919 0.5081 > >> 1000 1E-06 28756 20 10 0.9931 0.0069 > >> 1000 1E-06 28756 20 100 0.9328 0.0672 > >> 1000 1E-06 28756 20 1000 0.4988 0.5012 > >> 10000 1E-06 287552 20 100 0.9931 0.0069 > >> 10000 1E-06 287552 20 1000 0.9328 0.0672 > >> 10000 1E-06 287552 20 10000 0.4988 0.5012 > >> > >> The point is that if you create a Bloom filter and fill it to 10% of the > >> intended capacity the number of indices used will be about 6-7% of the > >> filter bits. > >> > >> So how to store the counts? Currently the counting bloom filter uses a > >> TreeMap<Integer, Integer>. I tried: > >> > >> TreeMap<Integer, Integer> > >> HashMap<Integer, Integer> > >> TreeSet<MutableCount> > >> HashSet<MutableCount> > >> int[] > >> > >> The MutableCount is a custom object that stores the bit index and uses > it > >> for a hash code and then has a mutable integer count field. It allows > the > >> count to be incremented/decremented if the object is in the set: > >> > >> static final class MutableCount implements Comparable<MutableCount> { > >> final int key; > >> int count; > >> // etc > >> } > >> > >> This is adapted from the Bag<T> collection which stores an item count > with > >> a MutableInteger. Here the mutable count is part of the same object T > >> inserted into the Set. So you can find the object, change the count and > not > >> have to put it back into the set. This is more efficient than changing > the > >> Integer stored in a Map. > >> > >> I’ve estimated memory usage using an idea based on this article from > >> JavaWorld: Java Tip 130: Do you know your data size? [1]. > >> > >> Basically you: > >> > >> - create an object and throw it away. All classes are then initialised. > >> - Then you free memory (run garbage collection) and get the current > memory > >> usage > >> - Then create a lot of your object (held in an array) > >> - Then measure memory usage again > >> - memory = (after - before) / count > >> > >> Here is some example output for n bits set in size m: > >> > >> 13107 / 262144 (0.050) : TreeMap<Integer, Integer> = 733947 bytes > >> 26214 / 262144 (0.100) : TreeMap<Integer, Integer> = 1467866 bytes > >> 13107 / 262144 (0.050) : TreeSet<MutableCount> = 838928 bytes > >> 26214 / 262144 (0.100) : TreeSet<MutableCount> = 1677776 bytes > >> 13107 / 262144 (0.050) : HashMap<Integer, Integer> = 1677712 bytes > >> 26214 / 262144 (0.100) : HashMap<Integer, Integer> = 2306739 bytes > >> 13107 / 262144 (0.050) : HashSet<MutableCount> = 1782664 bytes > >> 26214 / 262144 (0.100) : HashSet<MutableCount> = 2516656 bytes > >> 0 / 262144 (0.000) : int[] = 1048608 bytes > >> 0 / 262144 (0.000) : short[] = 524320 bytes > >> > >> The estimate is accurate to 0.0001% for the arrays so the method is > >> working. The HashMap was created with the capacity set to the expected > >> capacity of the filter (m). > >> > >> I’ve chosen these sizes because at 5% full a HashSet is less memory > >> efficient than using a fixed size array, and at 10% the TreeSet is also > >> less efficient. > >> > >> Note that the java.util.Tree/HashSet versions just wrap a Map and > insert a > >> dummy object for all keys in the Map. So here a Set is not as efficient > as > >> a Map because in the Map test I always inserted the same Integer object > >> representing 1. This would be the same as using a Set with an Integer > key > >> but here the Set had to contain the MutableCount which has an extra int > >> field and is larger than an Integer. > >> > >> These data lead me to think that a counting Bloom filter should just > use a > >> fixed size backing array: > >> > >> - If created using the same Shape as a standard Bloom filter it uses a > >> fixed size. This has high memory cost when the filter is empty but when > it > >> exceeds 10% of the intended capacity it is more efficient than a dynamic > >> backing storage. > >> - All operations will operate in order(n) time for an operation with > >> another filter with n indices. Each individual index count in the filter > >> will have order(1) time for access/update. Performance will be limited > by > >> the memory cache of the entire array. > >> > >> The issue is that a counting Bloom filter with a very low number of > >> inserted items will be memory inefficient. But under what circumstance > will > >> such a filter be used in a short-term lifecycle? If it is simply to > merge > >> into another filter then this can be done using a merge with a Hasher. > If > >> counts are to be merged then perhaps we provide a method to merge counts > >> using the same data structure returned by the CountingBloomFilter > >> getCounts() method, e.g. using a stream of <index,count> pairs: > >> > >> Stream<int[]> getCounts(); > >> void add(Stream<int[]> counts); > >> > >> The issue here is the Shape and HashFunctionIdentity of the origin of > the > >> merge cannot be validated. So just leave it out and use the merge with a > >> Hasher. > >> > >> > >> Thus the next issue with the counting Bloom filter implementation. > >> Currently when it merges with a Hasher it puts all the indices into a > Set > >> and so will only increment the count by 1 for each index identified by > the > >> Hasher. This appears to miss the entire point of the counting Bloom > filter. > >> If I hash an objects to generate k indices I would hope that I do get k > >> indices. But the hash may not be perfect and I may get [1, k] indices > with > >> some duplications. This is part of the signature of that object with the > >> given hash. So surely a counting Bloom filter should accommodate this. > If > >> my Hasher generates the same index 20 times this should result in the > count > >> of that index incrementing by 20. > >> > >> The result if that if an object is added direct to a counting Bloom > filter > >> using a Hasher it will have a different result that if added to a > standard > >> Bloom filter and then that filter added to the counting Bloom filter. > >> > >> Opinions on this? > >> > >> Alex > >> > >> > >> > >> [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html > >> > >>> > >>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert < > alex.d.herb...@gmail.com> > >>> wrote: > >>> > >>>> I have created a PR that contains most of the changes discussed in > this > >>>> thread (see [1]). > >>>> > >>>> Please review the changes and comment here or on GitHub. > >>>> > >>>> I have left the CountingBloomFilter alone. I will reimplement the > >>>> add/subtract functionality as discussed into another PR. > >>>> > >>>> Alex > >>>> > >>>> [1] https://github.com/apache/commons-collections/pull/133 < > >>>> https://github.com/apache/commons-collections/pull/133> > >>>> > >>>> > >>>> > >>> > >>> -- > >>> I like: Like Like - The likeliest place on the web > >>> <http://like-like.xenei.com> > >>> LinkedIn: http://www.linkedin.com/in/claudewarren > >> > >> > > > > -- > > I like: Like Like - The likeliest place on the web > > <http://like-like.xenei.com <http://like-like.xenei.com/>> > > LinkedIn: http://www.linkedin.com/in/claudewarren < > http://www.linkedin.com/in/claudewarren> > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren