I had a look through all the BloomFilter code. Thanks Claude for the
contribution.
Some items that need clarifying:
1. HashFunctionIdentity.Signedness
This is not fully documented as to what the sign applies to. There is no
javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
states "Identifies the signedness of the calculations for this function".
I assume this applies to the Hash computation in 'long
HashFunction.apply(byte[], int)'
Does this mean the hash algorithm has a variant that can treat the
bytes/seed as signed or unsigned. If so which one because 2 enum values
cannot cover all 4 possibilities. Since there is no javadoc it is
unclear exactly what this property is supposed to indicate.
2. HashFunctionIdentity comparators
The interface HashFunctionIdentity defines constants which are
comparators. Do these need to be here? Typically comparators should be
Serializable. Are these comparators ever going to be bundled with
BloomFilters and serialized? There are lots of ways to compare
HashFunctionIdentities. Comparators are for ordering. So why do you need
to order a HashFunction and why choose the two implementations provided?
From what I can see the COMMON comparator is used in the main code in
Shape, DynamicHasher and StaticHasher with:
if (compare(a, b) != 0) {
// not allowed ...
}
The DEEP comparator is not used in the main code.
The use case for checking that the hash function is the same seems to be
better satisfied by using a package private HashFunctionIdentityEquality
helper method:
static boolean areEqual(HashFunctionIdentity a,
HashFunctionIdentity b) {
return (a.getSignedness() == b.getSignedness() &&
a.getProcessType() == b.getProcessType() &&
a.getName().equalsIgnoreCase(b.getName()));
}
This would drop the comparators that no-one actually requires and
clean-up the public API.
I note that not everything uses the comparator.
AbstractBloomFilter.verfiyHasher uses:
if (a.getSignature() != b.getSignature()) {
// not allowed ...
}
However the signature (a long value) could be the same by chance with
p=2^-64. So why is this class allow to use getSignature() but others do
a full hash identity comparison? It seems to me you either do one or the
other: a full property equality test; or a quick signature check (which
is error prone); but not mix and match in different parts of the code.
3. AbtractBloomFilter
I considered changing the cardinality methods to avoid using BitSet
cardinality() and computing the cardinality direct. BitSet will
duplicate the input long[] in the constructor. For the or/xor/and
cardinality this is 2 array creations that can be avoided with a direct
implementation. For the cardinality method it avoids 1 array creation
via BitSet. Here is the type of implementation for a direct computation:
@Override
public int cardinality() {
int count = 0;
for (final long bits : getBits()) {
count += Long.bitCount(bits);
}
return count;
}
@Override
public int andCardinality(final BloomFilter other) {
verifyShape(other);
final long[] mine = getBits();
final long[] theirs = other.getBits();
final int limit = Integer.min(mine.length, theirs.length);
int count = 0;
for (int i = 0; i < limit; i++) {
count += Long.bitCount(mine[i] & theirs[i]);
}
return count;
}
@Override
public int orCardinality(final BloomFilter other) {
verifyShape(other);
final long[] mine = getBits();
final long[] theirs = other.getBits();
long[] small;
long[] big;
if (mine.length > theirs.length) {
big = mine;
small = theirs;
} else {
small = mine;
big = theirs;
}
int count = 0;
for (int i = 0; i < small.length; i++) {
// OR
count += Long.bitCount(small[i] | big[i]);
}
for (int i = small.length; i < big.length; i++) {
count += Long.bitCount(big[i]);
}
return count;
}
The xorCardinality is the same as orCardinality but using the ^ operator.
I didn't make the change but this should be considered after a
performance test with JMH.
There are a few other places where specialised implementations may be
more performant so a JMH benchmark would be useful for Bloom filter,
e.g. the important
BloomFilter.contains(Hasher)
BloomFilter.contains(BloomFilter)
Alex
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org