Re: [collections] Bloom filters

Alex Herbert Tue, 18 Feb 2020 01:12:13 -0800


> On 18 Feb 2020, at 08:02, Claude Warren <cla...@xenei.com> wrote:
> 
> On Mon, Feb 17, 2020 at 9:52 PM Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>>
> wrote:
> 
>> 
>> 
>>> On 17 Feb 2020, at 20:30, Claude Warren <cla...@xenei.com 
>>> <mailto:cla...@xenei.com>> wrote:
>>> 
>>> Alex,
>>> 
>>> Thank you for your comments.
>>> 
>>> See comments inline.
>>> 
>>> 
>>> 
>>> On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <alex.d.herb...@gmail.com 
>>> <mailto:alex.d.herb...@gmail.com>
>> <mailto:alex.d.herb...@gmail.com <mailto:alex.d.herb...@gmail.com>>>
>>> wrote:
>>> 
>>>> I had a look through all the BloomFilter code. Thanks Claude for the
>>>> contribution.
>>>> 
>>>> Some items that need clarifying:
>>>> 
>>>> 
>>>> 1. HashFunctionIdentity.Signedness
>>>> 
>>>> This is not fully documented as to what the sign applies to. There is no
>>>> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
>>>> states "Identifies the signedness of the calculations for this
>> function".
>>>> 
>>>> I assume this applies to the Hash computation in 'long
>>>> HashFunction.apply(byte[], int)'
>>>> 
>>>> Does this mean the hash algorithm has a variant that can treat the
>>>> bytes/seed as signed or unsigned. If so which one because 2 enum values
>>>> cannot cover all 4 possibilities. Since there is no javadoc it is
>>>> unclear exactly what this property is supposed to indicate.
>>>> 
>>>> I will attempt to clarify in the javadoc.  Basically when the underlying
>>> hash function executes it returns an array of bytes.  The algorithm can
>>> treat those bytes as signed or unsigned long values.  For the java
>>> implementations this is probably signed but other languages could
>> implement
>>> it as unsigned.   For example the murmur128 has returns 128 bytes this
>>> could be interpreted as 2 64-bit signed or unsigned longs.  The
>> Signedness
>>> attribute describes how that is interpreted.  As for the other values the
>>> seed is a signed value and the byte buffer is a byte buffer interpreted
>> as
>>> per the underlying hash function.
>> 
>> OK. So we add comments to the javadoc to state that the signedness is the
>> interpretation of the return value of
>> 
>> long HashFunction.apply(byte[], int)
>> 
>> So there can be algorithms that return a signed long and those that should
>> return an unsigned long but this is not a supported type in Java.
>> 
>> Either way it doesn’t really matter, it seems more for correctness than
>> for actual practical use given that unsigned and signed arithmetic on
>> twos-complement integers is the same.
>> 
> 
> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for all
> positive values of N?  If so then you are correct and Signedness does not
> matter and can be removed.  I thought that the statement was false.
>


Yes, you are correct. Addition and multiplication use the bits in the same way. 
Modulus does not.

So are the Murmur3 hash functions actually unsigned? I thought the original c 
code returned unsigned 64-bit values (uint64_t).

IIUC the purpose of the Signedness enum is to specify that a negative value 
returned from the hash function should be used as a numerical magnitude if then 
used in downstream computations since sign extension on negatives will set 
leading bits to 1s. I.e. using the negative as if it were unsigned (and all 
bits are random) is not valid. Perhaps you can give an example of how signed or 
unsigned would be handled differently.

> 
>>> 
>>> 
>>>> 2. HashFunctionIdentity comparators
>>>> 
>>>> The interface HashFunctionIdentity defines constants which are
>>>> comparators. Do these need to be here? Typically comparators should be
>>>> Serializable. Are these comparators ever going to be bundled with
>>>> BloomFilters and serialized? There are lots of ways to compare
>>>> HashFunctionIdentities. Comparators are for ordering. So why do you need
>>>> to order a HashFunction and why choose the two implementations provided?
>>>> 
>>> 
>>> There was a discussion on the list about building factories where the
>>> provider of the HashFunctionIdentity would be important (Think in terms
>> of
>>> the JCA where multiple providers can provide the same hash/encryption
>>> functions but an application may wish to choose one over the other).
>> 
>> OK. But given that this is not being done we could drop it from the API
>> for now to make it cleaner.
>> 
> OK
> 
>> 
>>> 
>>>> 
>>>> From what I can see the COMMON comparator is used in the main code in
>>>> Shape, DynamicHasher and StaticHasher with:
>>>> 
>>>> if (compare(a, b) != 0) {
>>>>    // not allowed ...
>>>> }
>>>> 
>>>> The DEEP comparator is not used in the main code.
>>> 
>>> 
>>>> The use case for checking that the hash function is the same seems to be
>>>> better satisfied by using a package private HashFunctionIdentityEquality
>>>> helper method:
>>>> 
>>>>    static boolean areEqual(HashFunctionIdentity a,
>>>> HashFunctionIdentity b) {
>>>>        return (a.getSignedness() == b.getSignedness() &&
>>>>            a.getProcessType() == b.getProcessType() &&
>>>>            a.getName().equalsIgnoreCase(b.getName()));
>>>>    }
>>>> 
>>>> This would drop the comparators that no-one actually requires and
>>>> clean-up the public API.
>>>> 
>>>> I can see the value of this solution.
>>> 
>>> I note that not everything uses the comparator.
>>>> AbstractBloomFilter.verfiyHasher uses:
>>>> 
>>>> if (a.getSignature() != b.getSignature()) {
>>>>    // not allowed ...
>>>> }
>>>> 
>>>> However the signature (a long value) could be the same by chance with
>>>> p=2^-64. So why is this class allow to use getSignature() but others do
>>>> a full hash identity comparison? It seems to me you either do one or the
>>>> other: a full property equality test; or a quick signature check (which
>>>> is error prone); but not mix and match in different parts of the code.
>>>> 
>>>> The reasoning for the second part is that bloom filter comparisons must
>> be
>>> very fast.  The possibility of signature collision is low but not zero.
>>> However, trusting the name has issues as well.  There are a number of
>>> murmur128 hash implementations that are not correct.  Using the signature
>>> will detect those because the signature value is built by the hashing
>>> function executing on the name.  Keep in mind that the actual function
>> that
>>> performed the hash may not be present when the signature is verified.
>> 
>> So given as you point out that the comparison of the name is also open to
>> flaws I would suggest dropping the comparators and using getSignature() to
>> check for hash function equality. It is not perfect but then no other
>> solution is either and as least it is fast.
>> 
> 
> Q: The code used earlier in the process (while building the filters) uses
> the comparators to do a deeper check at a point where the impact of such a
> check is minimal.  The later signature check is used where speed is
> required and we can reasonably assume the deeper check has been performed.
> Does it not make sense to retain both?

In this case we can use an internal package private equality function and drop 
the comparators from the public API. I’ll make these changes (with your 
comments on why we are using a full deep equals or quick signature equals) and 
you can review.

> 
> 
>> I think at least for the Murmur32x86Iterative class which is stateless
>> there is a case for a getInstance() method which could return a singleton
>> given the methods are stateless. Perhaps this pattern should be applied to
>> all the HashFunction implementations, i.e. private constructor and a
>> getInstance() method to obtain an instance.
>> 
>> 
> I am not certain this can be done in all cases.  Certainly  the javadoc
> would need to specify that the instance returned must be thread safe.  For
> example a simple static instance of MessageDigest (currently used in MD5
> implementation) could not be used. Additionally, In genomics I believe
> there are hasher constructors that require properties of the object being
> hashed before the hashing begins.

I did state that a singleton can be used for Murmur32x86Iterative, it was not 
clear that I was saying a singleton is not suitable for the others. I was 
discussing a change to the API to hide constructors.

The advantage of using the getInstance() method over a public INSTANCE field is 
that if in the future the class cannot be a singleton then the API does not 
need to change. So to allow support for the singleton I suggested adding the 
getInstance() method. In this case is would be strange to have a getInstance() 
method in Murmur32x86Iterative but not any of the others. Given that the 
classes are final a private constructor is fine and so a factor constructor 
approach could be used.

Anyway since the other classes would require a newInstance rather than the 
singleton version getInstance it is a rearrangement that does not provide much 
benefit. Perhaps I’ll just add a note to the javadoc for Murmur32x86Iterative 
that the class is thread-safe. If this change in the future then the javadoc 
can be updated.

> 
> 
>>> 
>>>> 
>>>> 3. AbtractBloomFilter
>>>> 
>>>> I considered changing the cardinality methods to avoid using BitSet
>>>> cardinality() and computing the cardinality direct. BitSet will
>>>> duplicate the input long[] in the constructor. For the or/xor/and
>>>> cardinality this is 2 array creations that can be avoided with a direct
>>>> implementation. For the cardinality method it avoids 1 array creation
>>>> via BitSet. Here is the type of implementation for a direct computation:
>>>> 
>>>>    @Override
>>>>    public int cardinality() {
>>>>        int count = 0;
>>>>        for (final long bits : getBits()) {
>>>>            count += Long.bitCount(bits);
>>>>        }
>>>>        return count;
>>>>    }
>>>> 
>>>>    @Override
>>>>    public int andCardinality(final BloomFilter other) {
>>>>        verifyShape(other);
>>>>        final long[] mine = getBits();
>>>>        final long[] theirs = other.getBits();
>>>>        final int limit = Integer.min(mine.length, theirs.length);
>>>>        int count = 0;
>>>>        for (int i = 0; i < limit; i++) {
>>>>            count += Long.bitCount(mine[i] & theirs[i]);
>>>>        }
>>>>        return count;
>>>>    }
>>>> 
>>>>    @Override
>>>>    public int orCardinality(final BloomFilter other) {
>>>>        verifyShape(other);
>>>>        final long[] mine = getBits();
>>>>        final long[] theirs = other.getBits();
>>>>        long[] small;
>>>>        long[] big;
>>>>        if (mine.length > theirs.length) {
>>>>            big = mine;
>>>>            small = theirs;
>>>>        } else {
>>>>            small = mine;
>>>>            big = theirs;
>>>>        }
>>>>        int count = 0;
>>>>        for (int i = 0; i < small.length; i++) {
>>>>            // OR
>>>>            count += Long.bitCount(small[i] | big[i]);
>>>>        }
>>>>        for (int i = small.length; i < big.length; i++) {
>>>>            count += Long.bitCount(big[i]);
>>>>        }
>>>>        return count;
>>>>    }
>>>> 
>>>> The xorCardinality is the same as orCardinality but using the ^
>> operator.
>>>> 
>>>> I didn't make the change but this should be considered after a
>>>> performance test with JMH.
>>>> 
>>> 
>>> Indeed your solutions would work against the longs returned by getBits()
>>> and makes more sense.
>> 
>> I will look at how other projects have incorporated JMH for performance
>> tests. It would be valuable here.
>> 
>> The purpose would be to show that it is faster to use a
>> BloomFilter.contains(…) method in front of a full check for object presence
>> (i.e. in a HashMap) rather than just a HashMap. Do you have any such
>> examples? I remember a while back you posted some information on speed
>> testing. Say for example a filter to store 10000 random URLs and then
>> testing random URL hits (with some actual bad hits) are in/not in the set
>> of bad URLs. I could mock this up as a simple speed test for the Bloom
>> filter but if you have other examples then it would be useful.
>> 
>> 
> The classes that perform the testing I reported earlier are in
> https://github.com/Claudenw/BloomTest <https://github.com/Claudenw/BloomTest>.

Thanks. I’ll take a look.

> 
> 
>>> 
>>> 
>>>> 
>>>> There are a few other places where specialised implementations may be
>>>> more performant so a JMH benchmark would be useful for Bloom filter,
>>>> e.g. the important
>>>> 
>>>> BloomFilter.contains(Hasher)
>>>> 
>>>> BloomFilter.contains(BloomFilter)
>>>> 
>>>> If you have better "standard"  implementations let's implement them.
>>> Implementations that have special needs (e.g. CountingBloomFilter,
>>> MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as
>> they
>>> see fit.
>> 
>> I was actually just wondering about a custom implementation of
>> BitSetBloomFilter. There are only a few operations from BitSet that are
>> used:
>> 
>> cardinality()
>> setBit(int)
>> or(BitSet)
>> and(BitSet)
>> xor(BitSet)
>> stream()
>> 
>> However BitSet has some overhead in that it maintains the ability to
>> expand in size and will do so for some of these operations. If replaced
>> with a FixedBitSetBloomFilter then it may be faster at the cost of having a
>> large initial allocation overhead.
>> 
>> At least in the case of the or/and/xor methods these are used with a
>> cloned BitSet for a cardinality count. This has the memory allocation
>> overhead for a potentially large object that will be thrown away. It is
>> used via the default method for BloomFilter.contains(BloomFilter) which
>> calls andCardinality. This may be the most commonly used method for a
>> BitSetBloomFilter other than BloomFilter.contains(Hasher).
>> 
>> All the methods apart from stream() are easy to implement. In that case it
>> seems what is actually required is a primitive iterator of nextSetBit(int).
>> So they require some real world use cases to determine if a custom
>> FixedBitSetBloomFilter with its own internal long[] would be better than
>> delegating to BitSet.
>> 
>> I have an example of a different bitset implementation using an EWAH
> compressed bitmap.  It can be found in
> https://github.com/Claudenw/MultidimentionalBloom/blob/master/src/main/java/org/xenei/bloom/filter/EWAHBloomFilter.java
>  
> <https://github.com/Claudenw/MultidimentionalBloom/blob/master/src/main/java/org/xenei/bloom/filter/EWAHBloomFilter.java>
> part of a multidimentional Bloom filter library.
> https://github.com/Claudenw/MultidimentionalBloom 
> <https://github.com/Claudenw/MultidimentionalBloom>
> 
> You can also find some interesting reading concerning bitmaps and blom
> filter implementations from Daniel Lemire on his github
> https://github.com/lemire <https://github.com/lemire>
> 
>> 
>>> Alex
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org 
>>>> <mailto:dev-unsubscr...@commons.apache.org> <mailto:
>> dev-unsubscr...@commons.apache.org 
>> <mailto:dev-unsubscr...@commons.apache.org>>
>>>> For additional commands, e-mail: dev-h...@commons.apache.org 
>>>> <mailto:dev-h...@commons.apache.org> <mailto:
>> dev-h...@commons.apache.org <mailto:dev-h...@commons.apache.org>>
>>>> 
>>>> 
>>> 
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com <http://like-like.xenei.com/> 
>>> <http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren 
>>> <http://www.linkedin.com/in/claudewarren> <
>> http://www.linkedin.com/in/claudewarren 
>> <http://www.linkedin.com/in/claudewarren>>
>> 
> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren 
> <http://www.linkedin.com/in/claudewarren>

Re: [collections] Bloom filters

Reply via email to