> On 17 Mar 2020, at 17:06, Claude Warren <cla...@xenei.com> wrote: > > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <alex.d.herb...@gmail.com> > wrote: > >> >> >>> On 17 Mar 2020, at 15:41, Claude Warren <cla...@xenei.com> wrote: >>> >>> I agree with the HashFunction changes. >> >> OK, but which ones? >> > > DOH! this one... > >> >> Changing HashFunction to have two methods: >> >> long hash(byte[]) >> long increment(int seed)
OK. I’ll update. >> >>> I think Builder should have >>> with(byte[]) >>> with(byte[], int offset, int len ) >> >> Not convinced here. The HashFunction requires a byte[] and cannot operate >> on a range. This change should be made in conjunction with a similar change >> to HashFunction. So should we update HashFunction to: >> >> > Given the depth of the change let's just leave the with( byte[] ) > > >>> with(String) >>> >>> I find that I use with(String) more than any other with() method. >> >> That may be so but String.getBytes(Charset) is trivial to call for the >> user. Then they get to decide on the encoding and not leave it to the >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice as a >> cross-language standard. Leave it out of the API for now, or add both: >> >> Builder with(CharSequence, Charset); >> Builder withUnencoded(CharSequence); >> > > CharSequence has no easy method to convert to a byte[]. While it could be > done, it looks to be more of a streaming interface. Let's leave that out. I was thinking: /** * Adds a character sequence item to the hasher using the specified encoding. * * @param item the item to add * @param charset the character set * @return a reference to this object */ default Builder with(CharSequence item, Charset charset) { return with(item.toString().getBytes(charset)); } /** * Adds a character sequence item to the hasher. Each 16-bit character is * converted to 2 bytes using little-endian order. * * @param item the item to add * @return a reference to this object */ default Builder withUnencoded(CharSequence item) { final int length = item.length(); final byte[] bytes = new byte[length * 2]; for (int i = 0; i < length; i++) { final char ch = item.charAt(i); bytes[i * 2] = (byte) ch; bytes[i * 2 + 1] = (byte) (ch >>> 8); } return with(bytes); } > > >> I would argue that you may use BloomFilters for Strings but if we see a >> BloomFilter as a collection then we should really support all Objects (with >> a decorator or by typing the Builder) or not support Objects. Currently we >> are not supporting any Object so for now would drop this and the >> Hasher.Builder then becomes a very simple API that specifies that you put >> in items represented as a byte[] and call build to create a Hasher >> containing those items and reset for further use. >> > > I have code example in several places where I hash GeoCode entities. Since > they are comprised of strings, for the most part, building a hasher for > them simply requires hashing the Strings. Many web services use JSON and > most JSON is string based. I disagree with removing with(String) because > it is so convenient in so many cases. It also makes the code > cleaner/easier to read. But if you feel strongly about removing it then OK. So with(String) is replaced by a default with(CharSequence, Charset) method. > > The only other thing that is really bothersome is the lack of > Shape.getNumberOfBytes(). Yes it is easy to call Math.ceil( > Shape.getNumberOfBits / 8.0 ). But getNumberOfBytes() is much more > readable in the code. I removed this as the number of bytes is implementation dependent, for example if using a compressed BloomFilter. If the number of bytes is intended to be for the uncompressed representation then this does not match up with the canonical format of the BloomFilter using a long[]. Q. Why do you need the number of bytes? One thing we do not have is the computation for a false probability using the current cardinality (this is in the Guava code): p(false positive) = (cardinality() / m) ^ k Which is just a probability that you will choose all k indexes as ones that are already set. This is not on the wikipedia page. Is this something you have used? Also on the wikipedia page is a section on CountingBloomFilters where each bucket is stated as typically having 3-4 bits. Using 4 bits would allow counting to 16 for each index. It would be more space efficient than the current use of 32-bit integers. Changing to 8-bit bytes is a simple compromise. There is also a main page on counting Bloom filters [1]: For a typical shape with p 1e-6 and n items we have N = 1000, M = 28756, k = 20 N = 10000, M = 287552, k = 20 Worst case scenario is each item added sets the same index and count could be equal to n. But if the hash is uniform then the count for each bit will be much lower. It is taken from the binomial distribution using number of trials = nk, and probability of success = 1/m. Here are the expected counts (l) for an index when the filter is filled with n items (p is just used to set optimal k and m using n): n k m p l binom(l, nk, 1/m) 1000 20 28756 1.00E-06 0 0.498815437 1000 20 28756 1.00E-06 1 0.346941705 1000 20 28756 1.00E-06 2 0.12064836 1000 20 28756 1.00E-06 4 0.004862559 1000 20 28756 1.00E-06 8 6.76619E-07 1000 20 28756 1.00E-06 16 7.10853E-17 1000 20 28756 1.00E-06 32 1.66389E-41 1000 20 28756 1.00E-06 64 2.8772E-100 1000 20 28756 1.00E-06 127 1.0449E-234 10000 20 287552 1.00E-06 0 0.498811214 10000 20 287552 1.00E-06 1 0.346937562 10000 20 287552 1.00E-06 2 0.120651928 10000 20 287552 1.00E-06 4 0.004863763 10000 20 287552 1.00E-06 8 6.77448E-07 10000 20 287552 1.00E-06 16 7.14657E-17 10000 20 287552 1.00E-06 32 1.70126E-41 10000 20 287552 1.00E-06 64 3.15E-100 10000 20 287552 1.00E-06 127 1.4984E-234 So being able to store counts of int max value is total overkill. We could using 4 bits per counter. But this is harder to do with the current design that uses the sign of entries as a guard digit for overflow or negative counts. A simple switch is to drop to a byte[] and allow counts up to 127. Any overflow would mark the state invalid. This makes the ArrayCountingBloomFilter 4 times more space efficient. Note that a counting Bloom filter is also described to have functionality we do not currently have. We can test for membership of a Hasher (e.g. an item) within the CountingBloomFilter but not query for a set number of the same item. Should this be added to the interface: boolean contains(Hasher hasher, int count) Here the Hasher is an issue as it can represent multiple items which could overlap indexes. In this case the 'contains a specified count' method is invalid as the expected count for those indexes that are overlapping would be higher. [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <https://en.wikipedia.org/wiki/Counting_Bloom_filter> > > Claude