Re: [collections] Bloom filters

Alex Herbert Tue, 18 Feb 2020 16:42:00 -0800

> On 18 Feb 2020, at 22:34, Gary Gregory <garydgreg...@gmail.com> wrote:
> 
> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <cla...@xenei.com> wrote:
> 
>> Last one first, why a tree map?  I think it is a holdover from an earlier
>> implementation.  It can be any reasonable Map (e.g. HashMap).
>> 
>> on the remove() issue.  You are absolutely correct, there is a bug.  I
>> 
> 
> May you please add a test to validate whatever fix you guys implement? I'm
> not following the details here but I want to stress the importance of tests
> to help future maintainers validate any future changes and protect us from
> regression bugs. IOW, while the current code base is in your head, more
> tests! ;-)
> 

I added a test that shows the current implementation does not do what I thought 
it should do based on the documentation in javadoc:

"For each bit that is turned on in the other filter; if the other filter is
 also a CountingBloomFilter the count is added to this filter, otherwise the
 count is incremented by one."

My test checks the counts are added. 

However it does what Claude thought it should do. The original test checks that 
the indices are incremented by 1.

So what to do about counting? If overflow or underflow occur on addition or 
subtraction you can throw an exception to signal error but if a partial 
operation has been made this leaves the filter in an undetermined state. The 
filter does not actually correspond to a sum of filters anymore. So it is 
invalid. Should any further operations on the filter be prevented? In this case 
throwing an IllegalStateException for all further operations seems extreme. It 
may be better to have the addition/subtraction return true/false if the 
operation was clean, i.e similar to a standard Set collection that returns true 
if adding/removing an item changes the state, false otherwise. The filter could 
also maintain a state flag to indicate that an overflow/underflow has occurred 
and the filter no longer represents a sum of filters.

Given this idea we change merge to increment the indices by 1 for any 
BloomFilter including a counting filter. It will not throw if it overflows but 
will set an invalid state flag. The logic here is the method is inherited from 
BloomFilter and so will function in the same way as other filters by 
accumulating set bits (indices).

Then add some methods:

boolean add(BloomFilter)
boolean remove(BloomFilter)
boolean isInvalid()

These update the counts using the counts of the other filter if it is a 
CountingBloomFilter, otherwise add or subtract 1. The methods return true if 
the operation added or removed the filter entirely, i.e. it is now a part of 
the sum or was removed from the sum. Otherwise they return false and the filter 
is marked as an invalid sum of filters. This leaves the counting filter in an 
undetermined state but it can still be used as a non-counting filter as the 
indices will be correctly maintained.

It is a compromise but there does not seem to be a nice way to do this without 
a lot of overhead. That would be to check the add/remove can be done entirely 
before actually performing the operation. If the operation would under/overflow 
then an exception can be thrown and the filter is left in a usable state. This 
could be achieved with some efficiency if the filter stored the maximum count. 
Then any add would be able to check for potential overflow quickly and do a 
safe add. However the subtraction could not know the indices that may underflow 
using a max value (since 0 - x underflows) and would always have to do a safe 
removal by checking all decrements before actually doing the entire removal. 
One idea here is a reusable list is kept to store all the indices that are to 
be decremented. Stage 1 finds all the indices and checks for underflow in the 
subtraction. If underflow then throw. Otherwise traverse the list and actually 
do the subtraction. The overhead of the second list traversal is likely to be 
much lower than the initial identification of indices in the backing map. This 
all becomes easy to do if the backing data structure stores both the index and 
a mutable field, e.g.

Set<MutableIndex> counts;
class MutableIndex {
   int index;
   int count;
   int hashCode() { return index; }
   // etc
}

You then find all the MutableIndex objects in the Set and put them in a 
reusable temp List (with the new count stored in a parallel list). If all can 
be incremented/decremented without under/overflow then just go over the list 
and update the count using the precomputed new count. The entire operation then 
works as a single transaction.

It may be better put into a TransactionCountingBloomFilter and then 
CountingBloomFilter is left to be faster and not care about over/underflow.

Thoughts on this approach? 

> Gary
> 
> 
>> looks like a partial edit got restored back into the code.  The first block
>> should be removed.  In fact it should probably read
>> 
>>           if (val == null || val == 0) {
>>                throw new IllegalStateException("Underflow on index " +
>> idx);
>>            } else if (val == 1) {
>>                counts.remove(idx);
>>            } else {
>>                counts.put(idx, val - 1);
>>            }
>> 
>> The question of merging/removing counting bloom filters is an interesting
>> one.
>> 
>> Merging a "normal" Bloom filter A into a counting Bloom filter B simply
>> increments all the bits specified  by A.getBits().
>> 
>> I took the approach that a counting bloom filter was just another Bloom
>> filter.  So merging counting Bloom filter A into counting Bloom filter B
>> simply incremented all the bits specified by A.getBits().
>> 
>> However the source has moved on since the original implementation was laid
>> down, and it might make sense that merging bloom filters adds the counts
>> from A into B.  Then if the developer wanted to perform a simple merge for
>> a Bloom filter the developer could use B.merge( A.getHasher() ).  However,
>> I felt that since a counting Bloom filter is an instance of Bloom filter
>> the merge should be driven by getBits().
>> 
>> Perhaps it makes sense for counting Bloom filters to have another method
>> add( CountingBloomFilter a) that would add the counts from A into B.  This
>> would necessitate a subtract( CountingBloomFilter a) as well.
>> 
>> the getCounts() method returns a Map.Entry because it was more semantic in
>> nature (e.g. it is obvious that it is a key-value pair) but I am note
>> adverse to to changing it to an int[].  If the question is why the method
>> at all  then the answers are 1. it is needed for testing (bad answer), 2.
>> it is the only way to retrieve the counts for each bit which can be
>> required by some applications.
>> 
>> Using a Bag might make sense, but my reading of the AbstractMapBag is that
>> it tracks the total of all counts in the bag, so we would overflow that
>> counter before we overflowed a single count.   This definition of size()
>> differs from the definition of size() for a map and so we would have to
>> rewrite the cardinality() method.  However, it would be nice if we could
>> lift the MutableInteger class.
>> 
>> In addition, we do need to check for overflow and underflow.  An underflow
>> indicates that a filter was removed that was not originally added.
>> Overflow is possible when processing streams as in a "stable Bloom filter"
>> (
>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
>> 
>> Claude
>> 
>> 
>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <alex.d.herb...@gmail.com>
>> wrote:
>> 
>>> I've updated master with some of the fixes discussed.
>>> 
>>> I've been looking through the rest of the BloomFilter code and the
>>> CountingBloomFilter appears to be broken:
>>> 
>>> 1. When another CountingBloomFiler is merged or removed the counts of
>>> the other filter are ignored. This is not as documented.
>>> 
>>> 2. When a remove operation is performed and the count for an index
>>> decrements past zero it throws an exception. This is not documented but
>>> enforced in the unit tests. This is very unfriendly. Digging into this
>>> part of the code and there is a duplication of logic for remove. The
>>> first part will not throw if the decrement passes zero. Then the same
>>> index is decremented again (with the same new value) and it will throw
>>> this time for underflow. So maybe at some point it didn't throw and now
>>> it does.
>>> 
>>> I commented the code where I think the problems are and added tests to
>>> master that fail with the current implementation.
>>> 
>>> I thought this type of logic is ideally suited to a Bag. So I rewrote
>>> the class using Bag<Integer>. This passes the same tests but has two
>>> differences:
>>> 
>>> a) It will not throw when remove would underflow. The bit index just
>>> gets set to zero (and that item is removed from the Bag).
>>> 
>>> b) Bag has no support for overflow in the count of each item. It
>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
>>> overflow to negative and then subtracting 1 would reset the count to
>>> zero because it was still negative after subtracting 1. In Bag this is
>>> not an issue as it is limited by Collections.size() returning an int for
>>> the count of everything in the Bag. So if you add that many items the
>>> collection will break elsewhere too.
>>> 
>>> I think the change (a) is an improvement. But are there situations where
>>> you want to know that you cannot remove a filter exactly from another
>>> one? If so perhaps a removeExactly(...) method for this case. I'd prefer
>>> to just drop the throwing of exceptions. After all if you do an
>>> andCardinality(BloomFilter) you do not get errors thrown when one cannot
>>> be "subtracted" from the other.
>>> 
>>> Change (b) is faster but could lead to errors. So how likely is overflow
>>> in a counting Bloom filter?
>>> 
>>> Both can be supported but it would have to use a custom Bag
>>> implementation. This is pretty trivial and would actually speed up other
>>> parts of the filter by having direct access to the Map underlying the
>>> Bag for faster merge and remove operations. This would then be a change
>>> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
>>> to store counts.
>>> 
>>> I can update the code to fix the implementation but will only do so
>>> after a decision on overflow is made. We could even have a custom
>>> implementation that stores counts as a long for very high counting
>>> filters. Is this likely to be used? Only in a situation where the
>>> exceptions from overflow are an issue.
>>> 
>>> 
>>> Other questions about this:
>>> 
>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
>>> 
>>> Perhaps this is better served raw as arrays of length 2. This has lower
>>> memory overhead:
>>> 
>>> public Stream<int[]> getCounts()
>>> 
>>> I've ruled out an accessor method as there is no equivalent for boolean
>>> getBit(int index) in BloomFilter, e.g:
>>> 
>>> public int getCount(int index)
>>> 
>>> - Why a TreeMap? Is the order of the indices important?
>>> 
>>> Or do you have some benchmarks to show that the TreeMap handles lots of
>>> growth and shrinkage better than a HashMap. There are situations where
>>> each one would be a better choice and so perhaps this is a case for
>>> having a CountingBloomFilter with a choice for the backing Map. This
>>> could be done using an abstract class with implementations. This is one
>>> to leave until some benchmarks are in place in the main code.
>>> 
>>> 
>>> On 18/02/2020 09:54, Claude Warren wrote:
>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <alex.d.herb...@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>> 
>>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
>> and
>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N)
>> for
>>>>> all
>>>>>> positive values of N?  If so then you are correct and Signedness does
>>> not
>>>>>> matter and can be removed.  I thought that the statement was false.
>>>>>> 
>>>>> Yes, you are correct. Addition and multiplication use the bits in the
>>> same
>>>>> way. Modulus does not.
>>>>> 
>>>>> So are the Murmur3 hash functions actually unsigned? I thought the
>>>>> original c code returned unsigned 64-bit values (uint64_t).
>>>>> 
>>>>> IIUC the purpose of the Signedness enum is to specify that a negative
>>>>> value returned from the hash function should be used as a numerical
>>>>> magnitude if then used in downstream computations since sign extension
>>> on
>>>>> negatives will set leading bits to 1s. I.e. using the negative as if
>> it
>>>>> were unsigned (and all bits are random) is not valid. Perhaps you can
>>> give
>>>>> an example of how signed or unsigned would be handled differently.
>>>>> 
>>>>> 
>>>> The murmur hash is not a good example here as it returns the resulting
>>>> buffer as 2 signed longs.  The MD5 is probably a better example as the
>>>> messageDigest.digest() returns a byte[128].  The code then interprets
>>> that
>>>> as 2 signed longs.  A C implementation could interpret that as 2
>> unsigned
>>>> longs.  Additionally, there are hashes that return smaller buffers such
>>> as
>>>> with murmur2 which returns a 32bit unsigned int in the C code.  In java
>>>> that could be represented as a long as per the Integer.asUnsignedlong()
>>>> method or it could just be cast to a long and thus signed.
>>>> 
>>>> The values returned are then passed to floorMod( value, numberOfBits )
>> to
>>>> turn on the appropriate bit in the resulting bit vector.  As noted
>> above
>>>> this will yield different values depending upon how the byte[] was
>>>> represented.
>>>> 
>>>> Claude
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>> 
>>> 
>> 
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
Re: [collections] Bloom filters

Reply via email to