Re: [collections] Bloom filters

Claude Warren Mon, 17 Feb 2020 12:31:14 -0800

Alex,

Thank you for your comments.


See comments inline.



On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

> I had a look through all the BloomFilter code. Thanks Claude for the
> contribution.
>
> Some items that need clarifying:
>
>
> 1. HashFunctionIdentity.Signedness
>
> This is not fully documented as to what the sign applies to. There is no
> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
> states "Identifies the signedness of the calculations for this function".
>
> I assume this applies to the Hash computation in 'long
> HashFunction.apply(byte[], int)'
>
> Does this mean the hash algorithm has a variant that can treat the
> bytes/seed as signed or unsigned. If so which one because 2 enum values
> cannot cover all 4 possibilities. Since there is no javadoc it is
> unclear exactly what this property is supposed to indicate.
>
> I will attempt to clarify in the javadoc.  Basically when the underlying
hash function executes it returns an array of bytes.  The algorithm can
treat those bytes as signed or unsigned long values.  For the java
implementations this is probably signed but other languages could implement
it as unsigned.   For example the murmur128 has returns 128 bytes this
could be interpreted as 2 64-bit signed or unsigned longs.  The Signedness
attribute describes how that is interpreted.  As for the other values the
seed is a signed value and the byte buffer is a byte buffer interpreted as
per the underlying hash function.


> 2. HashFunctionIdentity comparators
>
> The interface HashFunctionIdentity defines constants which are
> comparators. Do these need to be here? Typically comparators should be
> Serializable. Are these comparators ever going to be bundled with
> BloomFilters and serialized? There are lots of ways to compare
> HashFunctionIdentities. Comparators are for ordering. So why do you need
> to order a HashFunction and why choose the two implementations provided?
>

There was a discussion on the list about building factories where the
provider of the HashFunctionIdentity would be important (Think in terms of
the JCA where multiple providers can provide the same hash/encryption
functions but an application may wish to choose one over the other).

>
>  From what I can see the COMMON comparator is used in the main code in
> Shape, DynamicHasher and StaticHasher with:
>
> if (compare(a, b) != 0) {
>      // not allowed ...
> }
>
> The DEEP comparator is not used in the main code.


> The use case for checking that the hash function is the same seems to be
> better satisfied by using a package private HashFunctionIdentityEquality
> helper method:
>
>      static boolean areEqual(HashFunctionIdentity a,
> HashFunctionIdentity b) {
>          return (a.getSignedness() == b.getSignedness() &&
>              a.getProcessType() == b.getProcessType() &&
>              a.getName().equalsIgnoreCase(b.getName()));
>      }
>
> This would drop the comparators that no-one actually requires and
> clean-up the public API.
>
> I can see the value of this solution.

I note that not everything uses the comparator.
> AbstractBloomFilter.verfiyHasher uses:
>
> if (a.getSignature() != b.getSignature()) {
>      // not allowed ...
> }
>
> However the signature (a long value) could be the same by chance with
> p=2^-64. So why is this class allow to use getSignature() but others do
> a full hash identity comparison? It seems to me you either do one or the
> other: a full property equality test; or a quick signature check (which
> is error prone); but not mix and match in different parts of the code.
>
> The reasoning for the second part is that bloom filter comparisons must be
very fast.  The possibility of signature collision is low but not zero.
However, trusting the name has issues as well.  There are a number of
murmur128 hash implementations that are not correct.  Using the signature
will detect those because the signature value is built by the hashing
function executing on the name.  Keep in mind that the actual function that
performed the hash may not be present when the signature is verified.

>
> 3. AbtractBloomFilter
>
> I considered changing the cardinality methods to avoid using BitSet
> cardinality() and computing the cardinality direct. BitSet will
> duplicate the input long[] in the constructor. For the or/xor/and
> cardinality this is 2 array creations that can be avoided with a direct
> implementation. For the cardinality method it avoids 1 array creation
> via BitSet. Here is the type of implementation for a direct computation:
>
>      @Override
>      public int cardinality() {
>          int count = 0;
>          for (final long bits : getBits()) {
>              count += Long.bitCount(bits);
>          }
>          return count;
>      }
>
>      @Override
>      public int andCardinality(final BloomFilter other) {
>          verifyShape(other);
>          final long[] mine = getBits();
>          final long[] theirs = other.getBits();
>          final int limit = Integer.min(mine.length, theirs.length);
>          int count = 0;
>          for (int i = 0; i < limit; i++) {
>              count += Long.bitCount(mine[i] & theirs[i]);
>          }
>          return count;
>      }
>
>      @Override
>      public int orCardinality(final BloomFilter other) {
>          verifyShape(other);
>          final long[] mine = getBits();
>          final long[] theirs = other.getBits();
>          long[] small;
>          long[] big;
>          if (mine.length > theirs.length) {
>              big = mine;
>              small = theirs;
>          } else {
>              small = mine;
>              big = theirs;
>          }
>          int count = 0;
>          for (int i = 0; i < small.length; i++) {
>              // OR
>              count += Long.bitCount(small[i] | big[i]);
>          }
>          for (int i = small.length; i < big.length; i++) {
>              count += Long.bitCount(big[i]);
>          }
>          return count;
>      }
>
> The xorCardinality is the same as orCardinality but using the ^ operator.
>
> I didn't make the change but this should be considered after a
> performance test with JMH.
>

Indeed your solutions would work against the longs returned by getBits()
and makes more sense.


>
> There are a few other places where specialised implementations may be
> more performant so a JMH benchmark would be useful for Bloom filter,
> e.g. the important
>
> BloomFilter.contains(Hasher)
>
> BloomFilter.contains(BloomFilter)
>
> If you have better "standard"  implementations let's implement them.
Implementations that have special needs (e.g. CountingBloomFilter,
MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as they
see fit.

Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

Reply via email to