Re: [collections] Bloom filters

Alex Herbert Mon, 17 Feb 2020 13:52:39 -0800


> On 17 Feb 2020, at 20:30, Claude Warren <cla...@xenei.com> wrote:
> 
> Alex,
> 
> Thank you for your comments.
> 
> See comments inline.
> 
> 
> 
> On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>>
> wrote:
> 
>> I had a look through all the BloomFilter code. Thanks Claude for the
>> contribution.
>> 
>> Some items that need clarifying:
>> 
>> 
>> 1. HashFunctionIdentity.Signedness
>> 
>> This is not fully documented as to what the sign applies to. There is no
>> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
>> states "Identifies the signedness of the calculations for this function".
>> 
>> I assume this applies to the Hash computation in 'long
>> HashFunction.apply(byte[], int)'
>> 
>> Does this mean the hash algorithm has a variant that can treat the
>> bytes/seed as signed or unsigned. If so which one because 2 enum values
>> cannot cover all 4 possibilities. Since there is no javadoc it is
>> unclear exactly what this property is supposed to indicate.
>> 
>> I will attempt to clarify in the javadoc.  Basically when the underlying
> hash function executes it returns an array of bytes.  The algorithm can
> treat those bytes as signed or unsigned long values.  For the java
> implementations this is probably signed but other languages could implement
> it as unsigned.   For example the murmur128 has returns 128 bytes this
> could be interpreted as 2 64-bit signed or unsigned longs.  The Signedness
> attribute describes how that is interpreted.  As for the other values the
> seed is a signed value and the byte buffer is a byte buffer interpreted as
> per the underlying hash function.


OK. So we add comments to the javadoc to state that the signedness is the 
interpretation of the return value of 

long HashFunction.apply(byte[], int)

So there can be algorithms that return a signed long and those that should 
return an unsigned long but this is not a supported type in Java.

Either way it doesn’t really matter, it seems more for correctness than for 
actual practical use given that unsigned and signed arithmetic on 
twos-complement integers is the same.

> 
> 
>> 2. HashFunctionIdentity comparators
>> 
>> The interface HashFunctionIdentity defines constants which are
>> comparators. Do these need to be here? Typically comparators should be
>> Serializable. Are these comparators ever going to be bundled with
>> BloomFilters and serialized? There are lots of ways to compare
>> HashFunctionIdentities. Comparators are for ordering. So why do you need
>> to order a HashFunction and why choose the two implementations provided?
>> 
> 
> There was a discussion on the list about building factories where the
> provider of the HashFunctionIdentity would be important (Think in terms of
> the JCA where multiple providers can provide the same hash/encryption
> functions but an application may wish to choose one over the other).

OK. But given that this is not being done we could drop it from the API for now 
to make it cleaner.

> 
>> 
>> From what I can see the COMMON comparator is used in the main code in
>> Shape, DynamicHasher and StaticHasher with:
>> 
>> if (compare(a, b) != 0) {
>>     // not allowed ...
>> }
>> 
>> The DEEP comparator is not used in the main code.
> 
> 
>> The use case for checking that the hash function is the same seems to be
>> better satisfied by using a package private HashFunctionIdentityEquality
>> helper method:
>> 
>>     static boolean areEqual(HashFunctionIdentity a,
>> HashFunctionIdentity b) {
>>         return (a.getSignedness() == b.getSignedness() &&
>>             a.getProcessType() == b.getProcessType() &&
>>             a.getName().equalsIgnoreCase(b.getName()));
>>     }
>> 
>> This would drop the comparators that no-one actually requires and
>> clean-up the public API.
>> 
>> I can see the value of this solution.
> 
> I note that not everything uses the comparator.
>> AbstractBloomFilter.verfiyHasher uses:
>> 
>> if (a.getSignature() != b.getSignature()) {
>>     // not allowed ...
>> }
>> 
>> However the signature (a long value) could be the same by chance with
>> p=2^-64. So why is this class allow to use getSignature() but others do
>> a full hash identity comparison? It seems to me you either do one or the
>> other: a full property equality test; or a quick signature check (which
>> is error prone); but not mix and match in different parts of the code.
>> 
>> The reasoning for the second part is that bloom filter comparisons must be
> very fast.  The possibility of signature collision is low but not zero.
> However, trusting the name has issues as well.  There are a number of
> murmur128 hash implementations that are not correct.  Using the signature
> will detect those because the signature value is built by the hashing
> function executing on the name.  Keep in mind that the actual function that
> performed the hash may not be present when the signature is verified.

So given as you point out that the comparison of the name is also open to flaws 
I would suggest dropping the comparators and using getSignature() to check for 
hash function equality. It is not perfect but then no other solution is either 
and as least it is fast.

I think at least for the Murmur32x86Iterative class which is stateless there is 
a case for a getInstance() method which could return a singleton given the 
methods are stateless. Perhaps this pattern should be applied to all the 
HashFunction implementations, i.e. private constructor and a getInstance() 
method to obtain an instance. 

> 
>> 
>> 3. AbtractBloomFilter
>> 
>> I considered changing the cardinality methods to avoid using BitSet
>> cardinality() and computing the cardinality direct. BitSet will
>> duplicate the input long[] in the constructor. For the or/xor/and
>> cardinality this is 2 array creations that can be avoided with a direct
>> implementation. For the cardinality method it avoids 1 array creation
>> via BitSet. Here is the type of implementation for a direct computation:
>> 
>>     @Override
>>     public int cardinality() {
>>         int count = 0;
>>         for (final long bits : getBits()) {
>>             count += Long.bitCount(bits);
>>         }
>>         return count;
>>     }
>> 
>>     @Override
>>     public int andCardinality(final BloomFilter other) {
>>         verifyShape(other);
>>         final long[] mine = getBits();
>>         final long[] theirs = other.getBits();
>>         final int limit = Integer.min(mine.length, theirs.length);
>>         int count = 0;
>>         for (int i = 0; i < limit; i++) {
>>             count += Long.bitCount(mine[i] & theirs[i]);
>>         }
>>         return count;
>>     }
>> 
>>     @Override
>>     public int orCardinality(final BloomFilter other) {
>>         verifyShape(other);
>>         final long[] mine = getBits();
>>         final long[] theirs = other.getBits();
>>         long[] small;
>>         long[] big;
>>         if (mine.length > theirs.length) {
>>             big = mine;
>>             small = theirs;
>>         } else {
>>             small = mine;
>>             big = theirs;
>>         }
>>         int count = 0;
>>         for (int i = 0; i < small.length; i++) {
>>             // OR
>>             count += Long.bitCount(small[i] | big[i]);
>>         }
>>         for (int i = small.length; i < big.length; i++) {
>>             count += Long.bitCount(big[i]);
>>         }
>>         return count;
>>     }
>> 
>> The xorCardinality is the same as orCardinality but using the ^ operator.
>> 
>> I didn't make the change but this should be considered after a
>> performance test with JMH.
>> 
> 
> Indeed your solutions would work against the longs returned by getBits()
> and makes more sense.

I will look at how other projects have incorporated JMH for performance tests. 
It would be valuable here.

The purpose would be to show that it is faster to use a BloomFilter.contains(…) 
method in front of a full check for object presence (i.e. in a HashMap) rather 
than just a HashMap. Do you have any such examples? I remember a while back you 
posted some information on speed testing. Say for example a filter to store 
10000 random URLs and then testing random URL hits (with some actual bad hits) 
are in/not in the set of bad URLs. I could mock this up as a simple speed test 
for the Bloom filter but if you have other examples then it would be useful.

> 
> 
>> 
>> There are a few other places where specialised implementations may be
>> more performant so a JMH benchmark would be useful for Bloom filter,
>> e.g. the important
>> 
>> BloomFilter.contains(Hasher)
>> 
>> BloomFilter.contains(BloomFilter)
>> 
>> If you have better "standard"  implementations let's implement them.
> Implementations that have special needs (e.g. CountingBloomFilter,
> MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as they
> see fit.

I was actually just wondering about a custom implementation of 
BitSetBloomFilter. There are only a few operations from BitSet that are used:

cardinality()
setBit(int)
or(BitSet)
and(BitSet)
xor(BitSet)
stream()

However BitSet has some overhead in that it maintains the ability to expand in 
size and will do so for some of these operations. If replaced with a 
FixedBitSetBloomFilter then it may be faster at the cost of having a large 
initial allocation overhead.

At least in the case of the or/and/xor methods these are used with a cloned 
BitSet for a cardinality count. This has the memory allocation overhead for a 
potentially large object that will be thrown away. It is used via the default 
method for BloomFilter.contains(BloomFilter) which calls andCardinality. This 
may be the most commonly used method for a BitSetBloomFilter other than 
BloomFilter.contains(Hasher).

All the methods apart from stream() are easy to implement. In that case it 
seems what is actually required is a primitive iterator of nextSetBit(int). So 
they require some real world use cases to determine if a custom 
FixedBitSetBloomFilter with its own internal long[] would be better than 
delegating to BitSet.

> 
> Alex
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org 
>> <mailto:dev-unsubscr...@commons.apache.org>
>> For additional commands, e-mail: dev-h...@commons.apache.org 
>> <mailto:dev-h...@commons.apache.org>
>> 
>> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren 
> <http://www.linkedin.com/in/claudewarren>

Re: [collections] Bloom filters

Reply via email to