Re: [collections] Bloom filters

Claude Warren Tue, 18 Feb 2020 14:32:49 -0800

Last one first, why a tree map?  I think it is a holdover from an earlier
implementation.  It can be any reasonable Map (e.g. HashMap).


on the remove() issue.  You are absolutely correct, there is a bug.  I
looks like a partial edit got restored back into the code.  The first block
should be removed.  In fact it should probably read

           if (val == null || val == 0) {
                throw new IllegalStateException("Underflow on index " +
idx);
            } else if (val == 1) {
                counts.remove(idx);
            } else {
                counts.put(idx, val - 1);
            }

The question of merging/removing counting bloom filters is an interesting
one.

Merging a "normal" Bloom filter A into a counting Bloom filter B simply
increments all the bits specified  by A.getBits().

I took the approach that a counting bloom filter was just another Bloom
filter.  So merging counting Bloom filter A into counting Bloom filter B
simply incremented all the bits specified by A.getBits().

However the source has moved on since the original implementation was laid
down, and it might make sense that merging bloom filters adds the counts
from A into B.  Then if the developer wanted to perform a simple merge for
a Bloom filter the developer could use B.merge( A.getHasher() ).  However,
I felt that since a counting Bloom filter is an instance of Bloom filter
the merge should be driven by getBits().

Perhaps it makes sense for counting Bloom filters to have another method
add( CountingBloomFilter a) that would add the counts from A into B.  This
would necessitate a subtract( CountingBloomFilter a) as well.

the getCounts() method returns a Map.Entry because it was more semantic in
nature (e.g. it is obvious that it is a key-value pair) but I am note
adverse to to changing it to an int[].  If the question is why the method
at all  then the answers are 1. it is needed for testing (bad answer), 2.
it is the only way to retrieve the counts for each bit which can be
required by some applications.

Using a Bag might make sense, but my reading of the AbstractMapBag is that
it tracks the total of all counts in the bag, so we would overflow that
counter before we overflowed a single count.   This definition of size()
differs from the definition of size() for a map and so we would have to
rewrite the cardinality() method.  However, it would be nice if we could
lift the MutableInteger class.

In addition, we do need to check for overflow and underflow.  An underflow
indicates that a filter was removed that was not originally added.
Overflow is possible when processing streams as in a "stable Bloom filter" (
https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).

Claude


On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

> I've updated master with some of the fixes discussed.
>
> I've been looking through the rest of the BloomFilter code and the
> CountingBloomFilter appears to be broken:
>
> 1. When another CountingBloomFiler is merged or removed the counts of
> the other filter are ignored. This is not as documented.
>
> 2. When a remove operation is performed and the count for an index
> decrements past zero it throws an exception. This is not documented but
> enforced in the unit tests. This is very unfriendly. Digging into this
> part of the code and there is a duplication of logic for remove. The
> first part will not throw if the decrement passes zero. Then the same
> index is decremented again (with the same new value) and it will throw
> this time for underflow. So maybe at some point it didn't throw and now
> it does.
>
> I commented the code where I think the problems are and added tests to
> master that fail with the current implementation.
>
> I thought this type of logic is ideally suited to a Bag. So I rewrote
> the class using Bag<Integer>. This passes the same tests but has two
> differences:
>
> a) It will not throw when remove would underflow. The bit index just
> gets set to zero (and that item is removed from the Bag).
>
> b) Bag has no support for overflow in the count of each item. It
> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
> overflow to negative and then subtracting 1 would reset the count to
> zero because it was still negative after subtracting 1. In Bag this is
> not an issue as it is limited by Collections.size() returning an int for
> the count of everything in the Bag. So if you add that many items the
> collection will break elsewhere too.
>
> I think the change (a) is an improvement. But are there situations where
> you want to know that you cannot remove a filter exactly from another
> one? If so perhaps a removeExactly(...) method for this case. I'd prefer
> to just drop the throwing of exceptions. After all if you do an
> andCardinality(BloomFilter) you do not get errors thrown when one cannot
> be "subtracted" from the other.
>
> Change (b) is faster but could lead to errors. So how likely is overflow
> in a counting Bloom filter?
>
> Both can be supported but it would have to use a custom Bag
> implementation. This is pretty trivial and would actually speed up other
> parts of the filter by having direct access to the Map underlying the
> Bag for faster merge and remove operations. This would then be a change
> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
> to store counts.
>
> I can update the code to fix the implementation but will only do so
> after a decision on overflow is made. We could even have a custom
> implementation that stores counts as a long for very high counting
> filters. Is this likely to be used? Only in a situation where the
> exceptions from overflow are an issue.
>
>
> Other questions about this:
>
> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
>
> Perhaps this is better served raw as arrays of length 2. This has lower
> memory overhead:
>
> public Stream<int[]> getCounts()
>
> I've ruled out an accessor method as there is no equivalent for boolean
> getBit(int index) in BloomFilter, e.g:
>
> public int getCount(int index)
>
> - Why a TreeMap? Is the order of the indices important?
>
> Or do you have some benchmarks to show that the TreeMap handles lots of
> growth and shrinkage better than a HashMap. There are situations where
> each one would be a better choice and so perhaps this is a case for
> having a CountingBloomFilter with a choice for the backing Map. This
> could be done using an abstract class with implementations. This is one
> to leave until some benchmarks are in place in the main code.
>
>
> On 18/02/2020 09:54, Claude Warren wrote:
> > On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <alex.d.herb...@gmail.com>
> > wrote:
> >
> >
> >>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> >>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
> >> all
> >>> positive values of N?  If so then you are correct and Signedness does
> not
> >>> matter and can be removed.  I thought that the statement was false.
> >>>
> >> Yes, you are correct. Addition and multiplication use the bits in the
> same
> >> way. Modulus does not.
> >>
> >> So are the Murmur3 hash functions actually unsigned? I thought the
> >> original c code returned unsigned 64-bit values (uint64_t).
> >>
> >> IIUC the purpose of the Signedness enum is to specify that a negative
> >> value returned from the hash function should be used as a numerical
> >> magnitude if then used in downstream computations since sign extension
> on
> >> negatives will set leading bits to 1s. I.e. using the negative as if it
> >> were unsigned (and all bits are random) is not valid. Perhaps you can
> give
> >> an example of how signed or unsigned would be handled differently.
> >>
> >>
> > The murmur hash is not a good example here as it returns the resulting
> > buffer as 2 signed longs.  The MD5 is probably a better example as the
> > messageDigest.digest() returns a byte[128].  The code then interprets
> that
> > as 2 signed longs.  A C implementation could interpret that as 2 unsigned
> > longs.  Additionally, there are hashes that return smaller buffers such
> as
> > with murmur2 which returns a 32bit unsigned int in the C code.  In java
> > that could be represented as a long as per the Integer.asUnsignedlong()
> > method or it could just be cast to a long and thus signed.
> >
> > The values returned are then passed to floorMod( value, numberOfBits ) to
> > turn on the appropriate bit in the resulting bit vector.  As noted above
> > this will yield different values depending upon how the byte[] was
> > represented.
> >
> > Claude
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to