I had a look through all the BloomFilter code. Thanks Claude for the contribution.

Some items that need clarifying:


1. HashFunctionIdentity.Signedness

This is not fully documented as to what the sign applies to. There is no javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc states "Identifies the signedness of the calculations for this function".

I assume this applies to the Hash computation in 'long HashFunction.apply(byte[], int)'

Does this mean the hash algorithm has a variant that can treat the bytes/seed as signed or unsigned. If so which one because 2 enum values cannot cover all 4 possibilities. Since there is no javadoc it is unclear exactly what this property is supposed to indicate.


2. HashFunctionIdentity comparators

The interface HashFunctionIdentity defines constants which are comparators. Do these need to be here? Typically comparators should be Serializable. Are these comparators ever going to be bundled with BloomFilters and serialized? There are lots of ways to compare HashFunctionIdentities. Comparators are for ordering. So why do you need to order a HashFunction and why choose the two implementations provided?

From what I can see the COMMON comparator is used in the main code in Shape, DynamicHasher and StaticHasher with:

if (compare(a, b) != 0) {
    // not allowed ...
}

The DEEP comparator is not used in the main code.

The use case for checking that the hash function is the same seems to be better satisfied by using a package private HashFunctionIdentityEquality helper method:

    static boolean areEqual(HashFunctionIdentity a, HashFunctionIdentity b) {
        return (a.getSignedness() == b.getSignedness() &&
            a.getProcessType() == b.getProcessType() &&
            a.getName().equalsIgnoreCase(b.getName()));
    }

This would drop the comparators that no-one actually requires and clean-up the public API.

I note that not everything uses the comparator. AbstractBloomFilter.verfiyHasher uses:

if (a.getSignature() != b.getSignature()) {
    // not allowed ...
}

However the signature (a long value) could be the same by chance with p=2^-64. So why is this class allow to use getSignature() but others do a full hash identity comparison? It seems to me you either do one or the other: a full property equality test; or a quick signature check (which is error prone); but not mix and match in different parts of the code.


3. AbtractBloomFilter

I considered changing the cardinality methods to avoid using BitSet cardinality() and computing the cardinality direct. BitSet will duplicate the input long[] in the constructor. For the or/xor/and cardinality this is 2 array creations that can be avoided with a direct implementation. For the cardinality method it avoids 1 array creation via BitSet. Here is the type of implementation for a direct computation:

    @Override
    public int cardinality() {
        int count = 0;
        for (final long bits : getBits()) {
            count += Long.bitCount(bits);
        }
        return count;
    }

    @Override
    public int andCardinality(final BloomFilter other) {
        verifyShape(other);
        final long[] mine = getBits();
        final long[] theirs = other.getBits();
        final int limit = Integer.min(mine.length, theirs.length);
        int count = 0;
        for (int i = 0; i < limit; i++) {
            count += Long.bitCount(mine[i] & theirs[i]);
        }
        return count;
    }

    @Override
    public int orCardinality(final BloomFilter other) {
        verifyShape(other);
        final long[] mine = getBits();
        final long[] theirs = other.getBits();
        long[] small;
        long[] big;
        if (mine.length > theirs.length) {
            big = mine;
            small = theirs;
        } else {
            small = mine;
            big = theirs;
        }
        int count = 0;
        for (int i = 0; i < small.length; i++) {
            // OR
            count += Long.bitCount(small[i] | big[i]);
        }
        for (int i = small.length; i < big.length; i++) {
            count += Long.bitCount(big[i]);
        }
        return count;
    }

The xorCardinality is the same as orCardinality but using the ^ operator.

I didn't make the change but this should be considered after a performance test with JMH.

There are a few other places where specialised implementations may be more performant so a JMH benchmark would be useful for Bloom filter, e.g. the important

BloomFilter.contains(Hasher)

BloomFilter.contains(BloomFilter)


Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to