Re: [BloomFilters] changes to BloomFilter

Alex Herbert Tue, 17 Mar 2020 11:23:49 -0700


> On 17 Mar 2020, at 17:06, Claude Warren <cla...@xenei.com> wrote:
> 
> On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> 
>> 
>> 
>>> On 17 Mar 2020, at 15:41, Claude Warren <cla...@xenei.com> wrote:
>>> 
>>> I agree with the HashFunction changes.
>> 
>> OK, but which ones?
>> 
> 
> DOH! this one...
> 
>> 
>> Changing HashFunction to have two methods:
>> 
>> long hash(byte[])
>> long increment(int seed)


OK. I’ll update.

>> 
>>> I think Builder should have
>>> with(byte[])
>>> with(byte[], int offset, int len )
>> 
>> Not convinced here. The HashFunction requires a byte[] and cannot operate
>> on a range. This change should be made in conjunction with a similar change
>> to HashFunction. So should we update HashFunction to:
>> 
>> 
> Given the depth of the change let's just leave the with( byte[] )
> 
> 
>>> with(String)
>>> 
>>> I find that I use with(String) more than any other with() method.
>> 
>> That may be so but String.getBytes(Charset) is trivial to call for the
>> user. Then they get to decide on the encoding and not leave it to the
>> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice as a
>> cross-language standard. Leave it out of the API for now, or add both:
>> 
>> Builder with(CharSequence, Charset);
>> Builder withUnencoded(CharSequence);
>> 
> 
> CharSequence has no easy method to convert to a byte[]. While it could be
> done, it looks to be more of a streaming interface.  Let's leave that out.

I was thinking:

        /**
         * Adds a character sequence item to the hasher using the specified 
encoding.
         *
         * @param item the item to add
         * @param charset the character set
         * @return a reference to this object
         */
        default Builder with(CharSequence item, Charset charset) {
            return with(item.toString().getBytes(charset));
        }

        /**
         * Adds a character sequence item to the hasher. Each 16-bit character 
is
         * converted to 2 bytes using little-endian order.
         *
         * @param item the item to add
         * @return a reference to this object
         */
        default Builder withUnencoded(CharSequence item) {
            final int length = item.length();
            final byte[] bytes = new byte[length * 2];
            for (int i = 0; i < length; i++) {
                final char ch = item.charAt(i);
                bytes[i * 2] = (byte) ch;
                bytes[i * 2 + 1] = (byte) (ch >>> 8);
            }
            return with(bytes);
        }

> 
> 
>> I would argue that you may use BloomFilters for Strings but if we see a
>> BloomFilter as a collection then we should really support all Objects (with
>> a decorator or by typing the Builder) or not support Objects. Currently we
>> are not supporting any Object so for now would drop this and the
>> Hasher.Builder then becomes a very simple API that specifies that you put
>> in items represented as a byte[] and call build to create a Hasher
>> containing those items and reset for further use.
>> 
> 
> I have code example in several places where I hash GeoCode entities.  Since
> they are comprised of strings, for the most part, building a hasher for
> them simply requires hashing the Strings.  Many web services use JSON and
> most JSON is string based.  I disagree with removing with(String) because
> it is so convenient in so many cases.  It also makes the code
> cleaner/easier to read.  But if you feel strongly about removing it then OK.

So with(String) is replaced by a default with(CharSequence, Charset) method.

> 
> The only other thing that is really bothersome is the lack of
> Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
> Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
> readable in the code.

I removed this as the number of bytes is implementation dependent, for example 
if using a compressed BloomFilter.

If the number of bytes is intended to be for the uncompressed representation 
then this does not match up with the canonical format of the BloomFilter using 
a long[].

Q. Why do you need the number of bytes?


One thing we do not have is the computation for a false probability using the 
current cardinality (this is in the Guava code):

p(false positive) = (cardinality() / m) ^ k

Which is just a probability that you will choose all k indexes as ones that are 
already set. This is not on the wikipedia page. Is this something you have used?


Also on the wikipedia page is a section on CountingBloomFilters where each 
bucket is stated as typically having 3-4 bits. Using 4 bits would allow 
counting to 16 for each index. It would be more space efficient than the 
current use of 32-bit integers. Changing to 8-bit bytes is a simple compromise. 
There is also a main page on counting Bloom filters [1]:

For a typical shape with p 1e-6 and n items we have

N = 1000, M = 28756, k = 20
N = 10000, M = 287552, k = 20

Worst case scenario is each item added sets the same index and count could be 
equal to n. But if the hash is uniform then the count for each bit will be much 
lower. It is taken from the binomial distribution using number of trials = nk, 
and probability of success = 1/m. Here are the expected counts (l) for an index 
when the filter is filled with n items (p is just used to set optimal k and m 
using n):

n       k     m        p           l    binom(l, nk, 1/m)
1000    20    28756    1.00E-06    0    0.498815437
1000    20    28756    1.00E-06    1    0.346941705
1000    20    28756    1.00E-06    2    0.12064836
1000    20    28756    1.00E-06    4    0.004862559
1000    20    28756    1.00E-06    8    6.76619E-07
1000    20    28756    1.00E-06    16   7.10853E-17
1000    20    28756    1.00E-06    32   1.66389E-41
1000    20    28756    1.00E-06    64   2.8772E-100
1000    20    28756    1.00E-06    127  1.0449E-234
10000   20    287552   1.00E-06    0    0.498811214
10000   20    287552   1.00E-06    1    0.346937562
10000   20    287552   1.00E-06    2    0.120651928
10000   20    287552   1.00E-06    4    0.004863763
10000   20    287552   1.00E-06    8    6.77448E-07
10000   20    287552   1.00E-06    16   7.14657E-17
10000   20    287552   1.00E-06    32   1.70126E-41
10000   20    287552   1.00E-06    64   3.15E-100
10000   20    287552   1.00E-06    127  1.4984E-234

So being able to store counts of int max value is total overkill. We could 
using 4 bits per counter. But this is harder to do with the current design that 
uses the sign of entries as a guard digit for overflow or negative counts. A 
simple switch is to drop to a byte[] and allow counts up to 127. Any overflow 
would mark the state invalid. This makes the ArrayCountingBloomFilter 4 times 
more space efficient.

Note that a counting Bloom filter is also described to have functionality we do 
not currently have. We can test for membership of a Hasher (e.g. an item) 
within the CountingBloomFilter but not query for a set number of the same item. 
Should this be added to the interface:

boolean contains(Hasher hasher, int count)

Here the Hasher is an issue as it can represent multiple items which could 
overlap indexes. In this case the 'contains a specified count' method is 
invalid as the expected count for those indexes that are overlapping would be 
higher.

[1] https://en.wikipedia.org/wiki/Counting_Bloom_filter 
<https://en.wikipedia.org/wiki/Counting_Bloom_filter>
> 
> Claude

Re: [BloomFilters] changes to BloomFilter

Reply via email to