> On 29 Feb 2020, at 07:46, Claude Warren <cla...@xenei.com> wrote: > > Alex would you take a look at pull request 131 [1]. it adds a new hasher > implementation and makes the HashFunctionValidator available for public use. > > https://github.com/apache/commons-collections/pull/131 > <https://github.com/apache/commons-collections/pull/131>
OK. I’ll take a look. I’ve been thinking about the counting Bloom filter and the backing storage. In summary: 1. The backing storage should be a fixed array. 2. Merging a Hasher should count duplicate indices, not flatten them all to a single count. For background I’ve used the standard formulas to estimate the number of indices that will be non-zero in a Bloom filter. The wikipedia page gives this formula for the expected number of bits set to 0 (E(q)) if you have inserted i elements into a filter of size m using k hash functions: E(q) = (1 - 1/m)^ki" So a rough guess of the number of indices (bits) used by a filter is 1-E(q). Here is a table of Bloom filters with different collision probabilities and the proportion of bits that will be set when 1%, 10%, 100% of the capacity of the filter has been met: n p m k I E(q) bits 1000 1E-04 19171 13 10 0.9932 0.0068 1000 1E-04 19171 13 100 0.9344 0.0656 1000 1E-04 19171 13 1000 0.5076 0.4924 1000 1E-05 23963 17 10 0.9929 0.0071 1000 1E-05 23963 17 100 0.9315 0.0685 1000 1E-05 23963 17 1000 0.4919 0.5081 1000 1E-06 28756 20 10 0.9931 0.0069 1000 1E-06 28756 20 100 0.9328 0.0672 1000 1E-06 28756 20 1000 0.4988 0.5012 10000 1E-06 287552 20 100 0.9931 0.0069 10000 1E-06 287552 20 1000 0.9328 0.0672 10000 1E-06 287552 20 10000 0.4988 0.5012 The point is that if you create a Bloom filter and fill it to 10% of the intended capacity the number of indices used will be about 6-7% of the filter bits. So how to store the counts? Currently the counting bloom filter uses a TreeMap<Integer, Integer>. I tried: TreeMap<Integer, Integer> HashMap<Integer, Integer> TreeSet<MutableCount> HashSet<MutableCount> int[] The MutableCount is a custom object that stores the bit index and uses it for a hash code and then has a mutable integer count field. It allows the count to be incremented/decremented if the object is in the set: static final class MutableCount implements Comparable<MutableCount> { final int key; int count; // etc } This is adapted from the Bag<T> collection which stores an item count with a MutableInteger. Here the mutable count is part of the same object T inserted into the Set. So you can find the object, change the count and not have to put it back into the set. This is more efficient than changing the Integer stored in a Map. I’ve estimated memory usage using an idea based on this article from JavaWorld: Java Tip 130: Do you know your data size? [1]. Basically you: - create an object and throw it away. All classes are then initialised. - Then you free memory (run garbage collection) and get the current memory usage - Then create a lot of your object (held in an array) - Then measure memory usage again - memory = (after - before) / count Here is some example output for n bits set in size m: 13107 / 262144 (0.050) : TreeMap<Integer, Integer> = 733947 bytes 26214 / 262144 (0.100) : TreeMap<Integer, Integer> = 1467866 bytes 13107 / 262144 (0.050) : TreeSet<MutableCount> = 838928 bytes 26214 / 262144 (0.100) : TreeSet<MutableCount> = 1677776 bytes 13107 / 262144 (0.050) : HashMap<Integer, Integer> = 1677712 bytes 26214 / 262144 (0.100) : HashMap<Integer, Integer> = 2306739 bytes 13107 / 262144 (0.050) : HashSet<MutableCount> = 1782664 bytes 26214 / 262144 (0.100) : HashSet<MutableCount> = 2516656 bytes 0 / 262144 (0.000) : int[] = 1048608 bytes 0 / 262144 (0.000) : short[] = 524320 bytes The estimate is accurate to 0.0001% for the arrays so the method is working. The HashMap was created with the capacity set to the expected capacity of the filter (m). I’ve chosen these sizes because at 5% full a HashSet is less memory efficient than using a fixed size array, and at 10% the TreeSet is also less efficient. Note that the java.util.Tree/HashSet versions just wrap a Map and insert a dummy object for all keys in the Map. So here a Set is not as efficient as a Map because in the Map test I always inserted the same Integer object representing 1. This would be the same as using a Set with an Integer key but here the Set had to contain the MutableCount which has an extra int field and is larger than an Integer. These data lead me to think that a counting Bloom filter should just use a fixed size backing array: - If created using the same Shape as a standard Bloom filter it uses a fixed size. This has high memory cost when the filter is empty but when it exceeds 10% of the intended capacity it is more efficient than a dynamic backing storage. - All operations will operate in order(n) time for an operation with another filter with n indices. Each individual index count in the filter will have order(1) time for access/update. Performance will be limited by the memory cache of the entire array. The issue is that a counting Bloom filter with a very low number of inserted items will be memory inefficient. But under what circumstance will such a filter be used in a short-term lifecycle? If it is simply to merge into another filter then this can be done using a merge with a Hasher. If counts are to be merged then perhaps we provide a method to merge counts using the same data structure returned by the CountingBloomFilter getCounts() method, e.g. using a stream of <index,count> pairs: Stream<int[]> getCounts(); void add(Stream<int[]> counts); The issue here is the Shape and HashFunctionIdentity of the origin of the merge cannot be validated. So just leave it out and use the merge with a Hasher. Thus the next issue with the counting Bloom filter implementation. Currently when it merges with a Hasher it puts all the indices into a Set and so will only increment the count by 1 for each index identified by the Hasher. This appears to miss the entire point of the counting Bloom filter. If I hash an objects to generate k indices I would hope that I do get k indices. But the hash may not be perfect and I may get [1, k] indices with some duplications. This is part of the signature of that object with the given hash. So surely a counting Bloom filter should accommodate this. If my Hasher generates the same index 20 times this should result in the count of that index incrementing by 20. The result if that if an object is added direct to a counting Bloom filter using a Hasher it will have a different result that if added to a standard Bloom filter and then that filter added to the counting Bloom filter. Opinions on this? Alex [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html > > On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <alex.d.herb...@gmail.com> > wrote: > >> I have created a PR that contains most of the changes discussed in this >> thread (see [1]). >> >> Please review the changes and comment here or on GitHub. >> >> I have left the CountingBloomFilter alone. I will reimplement the >> add/subtract functionality as discussed into another PR. >> >> Alex >> >> [1] https://github.com/apache/commons-collections/pull/133 < >> https://github.com/apache/commons-collections/pull/133> >> >> >> > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren