On a slightly different note. CountingBloomFilters have no way to perform a reload. All other bloom filters you can dump the bits and reload (trivial) but if you preserve the counts from a bloom filter and want to reload them you can't. We need a constructor that takes the index,count pairs somehow.
On Tue, Mar 17, 2020 at 10:34 PM Claude Warren <cla...@xenei.com> wrote: > Builder discussion: > > Let's go with > > >> Builder with(CharSequence, Charset); > >> Builder withUnencoded(CharSequence); > > Shape Discussion: > > as for getNumberOfBytes() it should return the maximum number of bytes > returned by a getBits() call to a filter with this shape. So yes, if there > is a compressed internal representation, no it won't be that. It is a > method on Shape so it should literally be Math.ceil( getNumberOfBits() / > 8.0 ) > > Basically, if you want to create an array that will fit all the bits > returned by BloomFilter.iterator() you need an array of > Shape.getNumberOfBytes(). And that is actually what I use it for. > > Bloom Filter Discussion: > > I have not use probability() on a single filter, just on the Shape. > However, I can see the value of it. It seems to me that the difference > between Shape.probability() and BloomFilter.probability() would give you an > indication of how many collisions have already occurred. > > In the SetOperations class there is an "estimateSize()" method. It is the > only single Bloom filter argument method in the class and I think it should > be moved into BloomFilter so we have 2 new methods: > > probability() > estimateSize() > > Counting Filter Discussion: > > As for counting filters we could implement several > int > short > byte > > Currently they would all have to return int counts but realistically that > is what would be used in code anyway. Also, once primitives can be used in > generics this will be easier. > > As for contains( Hasher, int ), I think we would need to add contains( > BloomFilter, int). If I understand correctly, contains( BloomFilter, X ) > would test that a BloomFilter has been added X times or rather that there > are enough counts in the right places to make it appear that BloomFilter > has been added X times. When used with a Hasher, remove the duplicates, > and perform the same test. > > I see no reason not to add them. > > On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert <alex.d.herb...@gmail.com> > wrote: > >> >> >> > On 17 Mar 2020, at 17:06, Claude Warren <cla...@xenei.com> wrote: >> > >> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <alex.d.herb...@gmail.com> >> > wrote: >> > >> >> >> >> >> >>> On 17 Mar 2020, at 15:41, Claude Warren <cla...@xenei.com> wrote: >> >>> >> >>> I agree with the HashFunction changes. >> >> >> >> OK, but which ones? >> >> >> > >> > DOH! this one... >> > >> >> >> >> Changing HashFunction to have two methods: >> >> >> >> long hash(byte[]) >> >> long increment(int seed) >> >> OK. I’ll update. >> >> >> >> >>> I think Builder should have >> >>> with(byte[]) >> >>> with(byte[], int offset, int len ) >> >> >> >> Not convinced here. The HashFunction requires a byte[] and cannot >> operate >> >> on a range. This change should be made in conjunction with a similar >> change >> >> to HashFunction. So should we update HashFunction to: >> >> >> >> >> > Given the depth of the change let's just leave the with( byte[] ) >> > >> > >> >>> with(String) >> >>> >> >>> I find that I use with(String) more than any other with() method. >> >> >> >> That may be so but String.getBytes(Charset) is trivial to call for the >> >> user. Then they get to decide on the encoding and not leave it to the >> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice >> as a >> >> cross-language standard. Leave it out of the API for now, or add both: >> >> >> >> Builder with(CharSequence, Charset); >> >> Builder withUnencoded(CharSequence); >> >> >> > >> > CharSequence has no easy method to convert to a byte[]. While it could >> be >> > done, it looks to be more of a streaming interface. Let's leave that >> out. >> >> I was thinking: >> >> /** >> * Adds a character sequence item to the hasher using the >> specified encoding. >> * >> * @param item the item to add >> * @param charset the character set >> * @return a reference to this object >> */ >> default Builder with(CharSequence item, Charset charset) { >> return with(item.toString().getBytes(charset)); >> } >> >> /** >> * Adds a character sequence item to the hasher. Each 16-bit >> character is >> * converted to 2 bytes using little-endian order. >> * >> * @param item the item to add >> * @return a reference to this object >> */ >> default Builder withUnencoded(CharSequence item) { >> final int length = item.length(); >> final byte[] bytes = new byte[length * 2]; >> for (int i = 0; i < length; i++) { >> final char ch = item.charAt(i); >> bytes[i * 2] = (byte) ch; >> bytes[i * 2 + 1] = (byte) (ch >>> 8); >> } >> return with(bytes); >> } >> >> > >> > >> >> I would argue that you may use BloomFilters for Strings but if we see a >> >> BloomFilter as a collection then we should really support all Objects >> (with >> >> a decorator or by typing the Builder) or not support Objects. >> Currently we >> >> are not supporting any Object so for now would drop this and the >> >> Hasher.Builder then becomes a very simple API that specifies that you >> put >> >> in items represented as a byte[] and call build to create a Hasher >> >> containing those items and reset for further use. >> >> >> > >> > I have code example in several places where I hash GeoCode entities. >> Since >> > they are comprised of strings, for the most part, building a hasher for >> > them simply requires hashing the Strings. Many web services use JSON >> and >> > most JSON is string based. I disagree with removing with(String) >> because >> > it is so convenient in so many cases. It also makes the code >> > cleaner/easier to read. But if you feel strongly about removing it >> then OK. >> >> So with(String) is replaced by a default with(CharSequence, Charset) >> method. >> >> > >> > The only other thing that is really bothersome is the lack of >> > Shape.getNumberOfBytes(). Yes it is easy to call Math.ceil( >> > Shape.getNumberOfBits / 8.0 ). But getNumberOfBytes() is much more >> > readable in the code. >> >> I removed this as the number of bytes is implementation dependent, for >> example if using a compressed BloomFilter. >> >> If the number of bytes is intended to be for the uncompressed >> representation then this does not match up with the canonical format of the >> BloomFilter using a long[]. >> >> Q. Why do you need the number of bytes? >> >> >> One thing we do not have is the computation for a false probability using >> the current cardinality (this is in the Guava code): >> >> p(false positive) = (cardinality() / m) ^ k >> >> Which is just a probability that you will choose all k indexes as ones >> that are already set. This is not on the wikipedia page. Is this something >> you have used? >> >> >> Also on the wikipedia page is a section on CountingBloomFilters where >> each bucket is stated as typically having 3-4 bits. Using 4 bits would >> allow counting to 16 for each index. It would be more space efficient than >> the current use of 32-bit integers. Changing to 8-bit bytes is a simple >> compromise. There is also a main page on counting Bloom filters [1]: >> >> For a typical shape with p 1e-6 and n items we have >> >> N = 1000, M = 28756, k = 20 >> N = 10000, M = 287552, k = 20 >> >> Worst case scenario is each item added sets the same index and count >> could be equal to n. But if the hash is uniform then the count for each bit >> will be much lower. It is taken from the binomial distribution using number >> of trials = nk, and probability of success = 1/m. Here are the expected >> counts (l) for an index when the filter is filled with n items (p is just >> used to set optimal k and m using n): >> >> n k m p l binom(l, nk, 1/m) >> 1000 20 28756 1.00E-06 0 0.498815437 >> 1000 20 28756 1.00E-06 1 0.346941705 >> 1000 20 28756 1.00E-06 2 0.12064836 >> 1000 20 28756 1.00E-06 4 0.004862559 >> 1000 20 28756 1.00E-06 8 6.76619E-07 >> 1000 20 28756 1.00E-06 16 7.10853E-17 >> 1000 20 28756 1.00E-06 32 1.66389E-41 >> 1000 20 28756 1.00E-06 64 2.8772E-100 >> 1000 20 28756 1.00E-06 127 1.0449E-234 >> 10000 20 287552 1.00E-06 0 0.498811214 >> 10000 20 287552 1.00E-06 1 0.346937562 >> 10000 20 287552 1.00E-06 2 0.120651928 >> 10000 20 287552 1.00E-06 4 0.004863763 >> 10000 20 287552 1.00E-06 8 6.77448E-07 >> 10000 20 287552 1.00E-06 16 7.14657E-17 >> 10000 20 287552 1.00E-06 32 1.70126E-41 >> 10000 20 287552 1.00E-06 64 3.15E-100 >> 10000 20 287552 1.00E-06 127 1.4984E-234 >> >> So being able to store counts of int max value is total overkill. We >> could using 4 bits per counter. But this is harder to do with the current >> design that uses the sign of entries as a guard digit for overflow or >> negative counts. A simple switch is to drop to a byte[] and allow counts up >> to 127. Any overflow would mark the state invalid. This makes the >> ArrayCountingBloomFilter 4 times more space efficient. >> >> Note that a counting Bloom filter is also described to have functionality >> we do not currently have. We can test for membership of a Hasher (e.g. an >> item) within the CountingBloomFilter but not query for a set number of the >> same item. Should this be added to the interface: >> >> boolean contains(Hasher hasher, int count) >> >> Here the Hasher is an issue as it can represent multiple items which >> could overlap indexes. In this case the 'contains a specified count' method >> is invalid as the expected count for those indexes that are overlapping >> would be higher. >> >> [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter < >> https://en.wikipedia.org/wiki/Counting_Bloom_filter> >> > >> > Claude >> >> > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren