Re: [BloomFilters] changes to BloomFilter

Claude Warren Wed, 18 Mar 2020 04:15:40 -0700

On a slightly different note.  CountingBloomFilters have no way to perform
a reload.  All other bloom filters you can dump the bits and reload
(trivial) but if you preserve the counts from a bloom filter and want to
reload them you can't.  We need a constructor that takes the index,count
pairs somehow.


On Tue, Mar 17, 2020 at 10:34 PM Claude Warren <cla...@xenei.com> wrote:

> Builder discussion:
>
> Let's go with
>
> >> Builder with(CharSequence, Charset);
> >> Builder withUnencoded(CharSequence);
>
> Shape Discussion:
>
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
>
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.
>
> Bloom Filter Discussion:
>
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
>
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
>
> probability()
> estimateSize()
>
> Counting Filter Discussion:
>
> As for counting filters we could implement several
> int
> short
> byte
>
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
>
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
>
> I see no reason not to add them.
>
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
>
>>
>>
>> > On 17 Mar 2020, at 17:06, Claude Warren <cla...@xenei.com> wrote:
>> >
>> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <alex.d.herb...@gmail.com>
>> > wrote:
>> >
>> >>
>> >>
>> >>> On 17 Mar 2020, at 15:41, Claude Warren <cla...@xenei.com> wrote:
>> >>>
>> >>> I agree with the HashFunction changes.
>> >>
>> >> OK, but which ones?
>> >>
>> >
>> > DOH! this one...
>> >
>> >>
>> >> Changing HashFunction to have two methods:
>> >>
>> >> long hash(byte[])
>> >> long increment(int seed)
>>
>> OK. I’ll update.
>>
>> >>
>> >>> I think Builder should have
>> >>> with(byte[])
>> >>> with(byte[], int offset, int len )
>> >>
>> >> Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
>> >> on a range. This change should be made in conjunction with a similar
>> change
>> >> to HashFunction. So should we update HashFunction to:
>> >>
>> >>
>> > Given the depth of the change let's just leave the with( byte[] )
>> >
>> >
>> >>> with(String)
>> >>>
>> >>> I find that I use with(String) more than any other with() method.
>> >>
>> >> That may be so but String.getBytes(Charset) is trivial to call for the
>> >> user. Then they get to decide on the encoding and not leave it to the
>> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
>> as a
>> >> cross-language standard. Leave it out of the API for now, or add both:
>> >>
>> >> Builder with(CharSequence, Charset);
>> >> Builder withUnencoded(CharSequence);
>> >>
>> >
>> > CharSequence has no easy method to convert to a byte[]. While it could
>> be
>> > done, it looks to be more of a streaming interface.  Let's leave that
>> out.
>>
>> I was thinking:
>>
>>         /**
>>          * Adds a character sequence item to the hasher using the
>> specified encoding.
>>          *
>>          * @param item the item to add
>>          * @param charset the character set
>>          * @return a reference to this object
>>          */
>>         default Builder with(CharSequence item, Charset charset) {
>>             return with(item.toString().getBytes(charset));
>>         }
>>
>>         /**
>>          * Adds a character sequence item to the hasher. Each 16-bit
>> character is
>>          * converted to 2 bytes using little-endian order.
>>          *
>>          * @param item the item to add
>>          * @return a reference to this object
>>          */
>>         default Builder withUnencoded(CharSequence item) {
>>             final int length = item.length();
>>             final byte[] bytes = new byte[length * 2];
>>             for (int i = 0; i < length; i++) {
>>                 final char ch = item.charAt(i);
>>                 bytes[i * 2] = (byte) ch;
>>                 bytes[i * 2 + 1] = (byte) (ch >>> 8);
>>             }
>>             return with(bytes);
>>         }
>>
>> >
>> >
>> >> I would argue that you may use BloomFilters for Strings but if we see a
>> >> BloomFilter as a collection then we should really support all Objects
>> (with
>> >> a decorator or by typing the Builder) or not support Objects.
>> Currently we
>> >> are not supporting any Object so for now would drop this and the
>> >> Hasher.Builder then becomes a very simple API that specifies that you
>> put
>> >> in items represented as a byte[] and call build to create a Hasher
>> >> containing those items and reset for further use.
>> >>
>> >
>> > I have code example in several places where I hash GeoCode entities.
>> Since
>> > they are comprised of strings, for the most part, building a hasher for
>> > them simply requires hashing the Strings.  Many web services use JSON
>> and
>> > most JSON is string based.  I disagree with removing with(String)
>> because
>> > it is so convenient in so many cases.  It also makes the code
>> > cleaner/easier to read.  But if you feel strongly about removing it
>> then OK.
>>
>> So with(String) is replaced by a default with(CharSequence, Charset)
>> method.
>>
>> >
>> > The only other thing that is really bothersome is the lack of
>> > Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
>> > Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
>> > readable in the code.
>>
>> I removed this as the number of bytes is implementation dependent, for
>> example if using a compressed BloomFilter.
>>
>> If the number of bytes is intended to be for the uncompressed
>> representation then this does not match up with the canonical format of the
>> BloomFilter using a long[].
>>
>> Q. Why do you need the number of bytes?
>>
>>
>> One thing we do not have is the computation for a false probability using
>> the current cardinality (this is in the Guava code):
>>
>> p(false positive) = (cardinality() / m) ^ k
>>
>> Which is just a probability that you will choose all k indexes as ones
>> that are already set. This is not on the wikipedia page. Is this something
>> you have used?
>>
>>
>> Also on the wikipedia page is a section on CountingBloomFilters where
>> each bucket is stated as typically having 3-4 bits. Using 4 bits would
>> allow counting to 16 for each index. It would be more space efficient than
>> the current use of 32-bit integers. Changing to 8-bit bytes is a simple
>> compromise. There is also a main page on counting Bloom filters [1]:
>>
>> For a typical shape with p 1e-6 and n items we have
>>
>> N = 1000, M = 28756, k = 20
>> N = 10000, M = 287552, k = 20
>>
>> Worst case scenario is each item added sets the same index and count
>> could be equal to n. But if the hash is uniform then the count for each bit
>> will be much lower. It is taken from the binomial distribution using number
>> of trials = nk, and probability of success = 1/m. Here are the expected
>> counts (l) for an index when the filter is filled with n items (p is just
>> used to set optimal k and m using n):
>>
>> n       k     m        p           l    binom(l, nk, 1/m)
>> 1000    20    28756    1.00E-06    0    0.498815437
>> 1000    20    28756    1.00E-06    1    0.346941705
>> 1000    20    28756    1.00E-06    2    0.12064836
>> 1000    20    28756    1.00E-06    4    0.004862559
>> 1000    20    28756    1.00E-06    8    6.76619E-07
>> 1000    20    28756    1.00E-06    16   7.10853E-17
>> 1000    20    28756    1.00E-06    32   1.66389E-41
>> 1000    20    28756    1.00E-06    64   2.8772E-100
>> 1000    20    28756    1.00E-06    127  1.0449E-234
>> 10000   20    287552   1.00E-06    0    0.498811214
>> 10000   20    287552   1.00E-06    1    0.346937562
>> 10000   20    287552   1.00E-06    2    0.120651928
>> 10000   20    287552   1.00E-06    4    0.004863763
>> 10000   20    287552   1.00E-06    8    6.77448E-07
>> 10000   20    287552   1.00E-06    16   7.14657E-17
>> 10000   20    287552   1.00E-06    32   1.70126E-41
>> 10000   20    287552   1.00E-06    64   3.15E-100
>> 10000   20    287552   1.00E-06    127  1.4984E-234
>>
>> So being able to store counts of int max value is total overkill. We
>> could using 4 bits per counter. But this is harder to do with the current
>> design that uses the sign of entries as a guard digit for overflow or
>> negative counts. A simple switch is to drop to a byte[] and allow counts up
>> to 127. Any overflow would mark the state invalid. This makes the
>> ArrayCountingBloomFilter 4 times more space efficient.
>>
>> Note that a counting Bloom filter is also described to have functionality
>> we do not currently have. We can test for membership of a Hasher (e.g. an
>> item) within the CountingBloomFilter but not query for a set number of the
>> same item. Should this be added to the interface:
>>
>> boolean contains(Hasher hasher, int count)
>>
>> Here the Hasher is an issue as it can represent multiple items which
>> could overlap indexes. In this case the 'contains a specified count' method
>> is invalid as the expected count for those indexes that are overlapping
>> would be higher.
>>
>> [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <
>> https://en.wikipedia.org/wiki/Counting_Bloom_filter>
>> >
>> > Claude
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

Reply via email to