Re: [collections] Bloom filters

Alex Herbert Wed, 19 Feb 2020 15:34:23 -0800

> On 19 Feb 2020, at 21:14, Claude Warren <cla...@xenei.com> wrote:
> 
> I think the compromise solution of removing the thrown exception and adding
> an isInvalid() method makes sense and fits within the parameters of how the
> counting Bloom filter should work.  However, I think the add() and remove()
> methods should (or add and subtract) should only accept CountingBloomFilter
> arguments.  The reasoning being that the methods are specific  to the
> counting filter and are defined in terms of the internal structure of the
> filter (the count of each bit) that are not present in the "standard" bloom
> filter.  I think that it makes it lexically clear.

OK. We can always add a transaction counting Bloom filter later that will 
ensure entire filters are added or removed cleanly. For now it can be user 
beware documentation that states if you subtract a filter you did not add then 
you can end up with a set of counts that are not a sum of filters.

I do note that you could do this anyway even with a transaction counting Bloom 
filter, e.g. A+B+C+D - E. It may or may not throw and the trust in on the user 
to ensure they do not subtract E from something that never had it added.

Thinking about the invalid flag perhaps the methods merge/add/subtract should 
just return 0 if the filter state is valid and -1 if invalid. Thus once a 
filter is invalid all calls would return the invalid flag. 

The reason for this is efficiency. The counts are allowed to roam freely over 
any range for an integer. You have a valid flag that is the bitwise OR of every 
count you have adjusted. If any over/underflow occurs then the valid flag will 
have a negative sign bit. The methods then return the signed shift of the valid 
flag:

return valid >> 31;

This would be -1 if any count ever over/underflows, zero otherwise.

This eliminates any conditional statements from the filter operations. The user 
just has to know this is how it works and so any downstream consumption of the 
counts would have to first check the isInvalid flag and sanitise them as 
necessary. You can even view the counts as unsigned int32 if you never subtract 
filters.

> 
> Do you have a branch someplace where you are doing this work?   I would
> like to see the proposed changes you outlined for the AbstractBloomFilter
> implemented as I think you are correct and they would be faster and cleaner.

No but I’ll create one. I’m waiting for a big PR against collections to go 
through (an update to force use of checkstyle). Then I’ll revert the changes I 
made to the CountingBloomFilterTest to its previous state and put the changes 
into a branch. These would have a new internal structure for the counts and the 
methods for add/subtract.

For the AbstractBloomFilter do you mean the use of the mod(long, int) method in 
the Signedness enum? I think the removal of Math.floorMod can be done in two 
places with bit shift but should we also use the % operator instead of 
Math.floorMod for Signedness.UNSIGNED?


> 
> Claude
> 
> On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> 
>> 
>> 
>>> On 18 Feb 2020, at 22:34, Gary Gregory <garydgreg...@gmail.com> wrote:
>>> 
>>> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <cla...@xenei.com> wrote:
>>> 
>>>> Last one first, why a tree map?  I think it is a holdover from an
>> earlier
>>>> implementation.  It can be any reasonable Map (e.g. HashMap).
>>>> 
>>>> on the remove() issue.  You are absolutely correct, there is a bug.  I
>>>> 
>>> 
>>> May you please add a test to validate whatever fix you guys implement?
>> I'm
>>> not following the details here but I want to stress the importance of
>> tests
>>> to help future maintainers validate any future changes and protect us
>> from
>>> regression bugs. IOW, while the current code base is in your head, more
>>> tests! ;-)
>>> 
>> 
>> I added a test that shows the current implementation does not do what I
>> thought it should do based on the documentation in javadoc:
>> 
>> "For each bit that is turned on in the other filter; if the other filter is
>> also a CountingBloomFilter the count is added to this filter, otherwise
>> the
>> count is incremented by one."
>> 
>> My test checks the counts are added.
>> 
>> However it does what Claude thought it should do. The original test checks
>> that the indices are incremented by 1.
>> 
>> So what to do about counting? If overflow or underflow occur on addition
>> or subtraction you can throw an exception to signal error but if a partial
>> operation has been made this leaves the filter in an undetermined state.
>> The filter does not actually correspond to a sum of filters anymore. So it
>> is invalid. Should any further operations on the filter be prevented? In
>> this case throwing an IllegalStateException for all further operations
>> seems extreme. It may be better to have the addition/subtraction return
>> true/false if the operation was clean, i.e similar to a standard Set
>> collection that returns true if adding/removing an item changes the state,
>> false otherwise. The filter could also maintain a state flag to indicate
>> that an overflow/underflow has occurred and the filter no longer represents
>> a sum of filters.
>> 
>> Given this idea we change merge to increment the indices by 1 for any
>> BloomFilter including a counting filter. It will not throw if it overflows
>> but will set an invalid state flag. The logic here is the method is
>> inherited from BloomFilter and so will function in the same way as other
>> filters by accumulating set bits (indices).
>> 
>> Then add some methods:
>> 
>> boolean add(BloomFilter)
>> boolean remove(BloomFilter)
>> boolean isInvalid()
>> 
>> These update the counts using the counts of the other filter if it is a
>> CountingBloomFilter, otherwise add or subtract 1. The methods return true
>> if the operation added or removed the filter entirely, i.e. it is now a
>> part of the sum or was removed from the sum. Otherwise they return false
>> and the filter is marked as an invalid sum of filters. This leaves the
>> counting filter in an undetermined state but it can still be used as a
>> non-counting filter as the indices will be correctly maintained.
>> 
>> It is a compromise but there does not seem to be a nice way to do this
>> without a lot of overhead. That would be to check the add/remove can be
>> done entirely before actually performing the operation. If the operation
>> would under/overflow then an exception can be thrown and the filter is left
>> in a usable state. This could be achieved with some efficiency if the
>> filter stored the maximum count. Then any add would be able to check for
>> potential overflow quickly and do a safe add. However the subtraction could
>> not know the indices that may underflow using a max value (since 0 - x
>> underflows) and would always have to do a safe removal by checking all
>> decrements before actually doing the entire removal. One idea here is a
>> reusable list is kept to store all the indices that are to be decremented.
>> Stage 1 finds all the indices and checks for underflow in the subtraction.
>> If underflow then throw. Otherwise traverse the list and actually do the
>> subtraction. The overhead of the second list traversal is likely to be much
>> lower than the initial identification of indices in the backing map. This
>> all becomes easy to do if the backing data structure stores both the index
>> and a mutable field, e.g.
>> 
>> Set<MutableIndex> counts;
>> class MutableIndex {
>>   int index;
>>   int count;
>>   int hashCode() { return index; }
>>   // etc
>> }
>> 
>> You then find all the MutableIndex objects in the Set and put them in a
>> reusable temp List (with the new count stored in a parallel list). If all
>> can be incremented/decremented without under/overflow then just go over the
>> list and update the count using the precomputed new count. The entire
>> operation then works as a single transaction.
>> 
>> It may be better put into a TransactionCountingBloomFilter and then
>> CountingBloomFilter is left to be faster and not care about over/underflow.
>> 
>> Thoughts on this approach?
>> 
>>> Gary
>>> 
>>> 
>>>> looks like a partial edit got restored back into the code.  The first
>> block
>>>> should be removed.  In fact it should probably read
>>>> 
>>>>          if (val == null || val == 0) {
>>>>               throw new IllegalStateException("Underflow on index " +
>>>> idx);
>>>>           } else if (val == 1) {
>>>>               counts.remove(idx);
>>>>           } else {
>>>>               counts.put(idx, val - 1);
>>>>           }
>>>> 
>>>> The question of merging/removing counting bloom filters is an
>> interesting
>>>> one.
>>>> 
>>>> Merging a "normal" Bloom filter A into a counting Bloom filter B simply
>>>> increments all the bits specified  by A.getBits().
>>>> 
>>>> I took the approach that a counting bloom filter was just another Bloom
>>>> filter.  So merging counting Bloom filter A into counting Bloom filter B
>>>> simply incremented all the bits specified by A.getBits().
>>>> 
>>>> However the source has moved on since the original implementation was
>> laid
>>>> down, and it might make sense that merging bloom filters adds the counts
>>>> from A into B.  Then if the developer wanted to perform a simple merge
>> for
>>>> a Bloom filter the developer could use B.merge( A.getHasher() ).
>> However,
>>>> I felt that since a counting Bloom filter is an instance of Bloom filter
>>>> the merge should be driven by getBits().
>>>> 
>>>> Perhaps it makes sense for counting Bloom filters to have another method
>>>> add( CountingBloomFilter a) that would add the counts from A into B.
>> This
>>>> would necessitate a subtract( CountingBloomFilter a) as well.
>>>> 
>>>> the getCounts() method returns a Map.Entry because it was more semantic
>> in
>>>> nature (e.g. it is obvious that it is a key-value pair) but I am note
>>>> adverse to to changing it to an int[].  If the question is why the
>> method
>>>> at all  then the answers are 1. it is needed for testing (bad answer),
>> 2.
>>>> it is the only way to retrieve the counts for each bit which can be
>>>> required by some applications.
>>>> 
>>>> Using a Bag might make sense, but my reading of the AbstractMapBag is
>> that
>>>> it tracks the total of all counts in the bag, so we would overflow that
>>>> counter before we overflowed a single count.   This definition of size()
>>>> differs from the definition of size() for a map and so we would have to
>>>> rewrite the cardinality() method.  However, it would be nice if we could
>>>> lift the MutableInteger class.
>>>> 
>>>> In addition, we do need to check for overflow and underflow.  An
>> underflow
>>>> indicates that a filter was removed that was not originally added.
>>>> Overflow is possible when processing streams as in a "stable Bloom
>> filter"
>>>> (
>>>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
>>>> 
>>>> Claude
>>>> 
>>>> 
>>>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <alex.d.herb...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I've updated master with some of the fixes discussed.
>>>>> 
>>>>> I've been looking through the rest of the BloomFilter code and the
>>>>> CountingBloomFilter appears to be broken:
>>>>> 
>>>>> 1. When another CountingBloomFiler is merged or removed the counts of
>>>>> the other filter are ignored. This is not as documented.
>>>>> 
>>>>> 2. When a remove operation is performed and the count for an index
>>>>> decrements past zero it throws an exception. This is not documented but
>>>>> enforced in the unit tests. This is very unfriendly. Digging into this
>>>>> part of the code and there is a duplication of logic for remove. The
>>>>> first part will not throw if the decrement passes zero. Then the same
>>>>> index is decremented again (with the same new value) and it will throw
>>>>> this time for underflow. So maybe at some point it didn't throw and now
>>>>> it does.
>>>>> 
>>>>> I commented the code where I think the problems are and added tests to
>>>>> master that fail with the current implementation.
>>>>> 
>>>>> I thought this type of logic is ideally suited to a Bag. So I rewrote
>>>>> the class using Bag<Integer>. This passes the same tests but has two
>>>>> differences:
>>>>> 
>>>>> a) It will not throw when remove would underflow. The bit index just
>>>>> gets set to zero (and that item is removed from the Bag).
>>>>> 
>>>>> b) Bag has no support for overflow in the count of each item. It
>>>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
>>>>> overflow to negative and then subtracting 1 would reset the count to
>>>>> zero because it was still negative after subtracting 1. In Bag this is
>>>>> not an issue as it is limited by Collections.size() returning an int
>> for
>>>>> the count of everything in the Bag. So if you add that many items the
>>>>> collection will break elsewhere too.
>>>>> 
>>>>> I think the change (a) is an improvement. But are there situations
>> where
>>>>> you want to know that you cannot remove a filter exactly from another
>>>>> one? If so perhaps a removeExactly(...) method for this case. I'd
>> prefer
>>>>> to just drop the throwing of exceptions. After all if you do an
>>>>> andCardinality(BloomFilter) you do not get errors thrown when one
>> cannot
>>>>> be "subtracted" from the other.
>>>>> 
>>>>> Change (b) is faster but could lead to errors. So how likely is
>> overflow
>>>>> in a counting Bloom filter?
>>>>> 
>>>>> Both can be supported but it would have to use a custom Bag
>>>>> implementation. This is pretty trivial and would actually speed up
>> other
>>>>> parts of the filter by having direct access to the Map underlying the
>>>>> Bag for faster merge and remove operations. This would then be a change
>>>>> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
>>>>> to store counts.
>>>>> 
>>>>> I can update the code to fix the implementation but will only do so
>>>>> after a decision on overflow is made. We could even have a custom
>>>>> implementation that stores counts as a long for very high counting
>>>>> filters. Is this likely to be used? Only in a situation where the
>>>>> exceptions from overflow are an issue.
>>>>> 
>>>>> 
>>>>> Other questions about this:
>>>>> 
>>>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
>>>>> 
>>>>> Perhaps this is better served raw as arrays of length 2. This has lower
>>>>> memory overhead:
>>>>> 
>>>>> public Stream<int[]> getCounts()
>>>>> 
>>>>> I've ruled out an accessor method as there is no equivalent for boolean
>>>>> getBit(int index) in BloomFilter, e.g:
>>>>> 
>>>>> public int getCount(int index)
>>>>> 
>>>>> - Why a TreeMap? Is the order of the indices important?
>>>>> 
>>>>> Or do you have some benchmarks to show that the TreeMap handles lots of
>>>>> growth and shrinkage better than a HashMap. There are situations where
>>>>> each one would be a better choice and so perhaps this is a case for
>>>>> having a CountingBloomFilter with a choice for the backing Map. This
>>>>> could be done using an abstract class with implementations. This is one
>>>>> to leave until some benchmarks are in place in the main code.
>>>>> 
>>>>> 
>>>>> On 18/02/2020 09:54, Claude Warren wrote:
>>>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <
>> alex.d.herb...@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
>>>> and
>>>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N)
>>>> for
>>>>>>> all
>>>>>>>> positive values of N?  If so then you are correct and Signedness
>> does
>>>>> not
>>>>>>>> matter and can be removed.  I thought that the statement was false.
>>>>>>>> 
>>>>>>> Yes, you are correct. Addition and multiplication use the bits in the
>>>>> same
>>>>>>> way. Modulus does not.
>>>>>>> 
>>>>>>> So are the Murmur3 hash functions actually unsigned? I thought the
>>>>>>> original c code returned unsigned 64-bit values (uint64_t).
>>>>>>> 
>>>>>>> IIUC the purpose of the Signedness enum is to specify that a negative
>>>>>>> value returned from the hash function should be used as a numerical
>>>>>>> magnitude if then used in downstream computations since sign
>> extension
>>>>> on
>>>>>>> negatives will set leading bits to 1s. I.e. using the negative as if
>>>> it
>>>>>>> were unsigned (and all bits are random) is not valid. Perhaps you can
>>>>> give
>>>>>>> an example of how signed or unsigned would be handled differently.
>>>>>>> 
>>>>>>> 
>>>>>> The murmur hash is not a good example here as it returns the resulting
>>>>>> buffer as 2 signed longs.  The MD5 is probably a better example as the
>>>>>> messageDigest.digest() returns a byte[128].  The code then interprets
>>>>> that
>>>>>> as 2 signed longs.  A C implementation could interpret that as 2
>>>> unsigned
>>>>>> longs.  Additionally, there are hashes that return smaller buffers
>> such
>>>>> as
>>>>>> with murmur2 which returns a 32bit unsigned int in the C code.  In
>> java
>>>>>> that could be represented as a long as per the
>> Integer.asUnsignedlong()
>>>>>> method or it could just be cast to a long and thus signed.
>>>>>> 
>>>>>> The values returned are then passed to floorMod( value, numberOfBits )
>>>> to
>>>>>> turn on the appropriate bit in the resulting bit vector.  As noted
>>>> above
>>>>>> this will yield different values depending upon how the byte[] was
>>>>>> represented.
>>>>>> 
>>>>>> Claude
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> I like: Like Like - The likeliest place on the web
>>>> <http://like-like.xenei.com>
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>> 
>> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
Re: [collections] Bloom filters

Reply via email to