On 18/02/2020 09:54, Claude Warren wrote:
On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <alex.d.herb...@gmail.com>
wrote:


My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
all
positive values of N?  If so then you are correct and Signedness does not
matter and can be removed.  I thought that the statement was false.

Yes, you are correct. Addition and multiplication use the bits in the same
way. Modulus does not.

So are the Murmur3 hash functions actually unsigned? I thought the
original c code returned unsigned 64-bit values (uint64_t).

IIUC the purpose of the Signedness enum is to specify that a negative
value returned from the hash function should be used as a numerical
magnitude if then used in downstream computations since sign extension on
negatives will set leading bits to 1s. I.e. using the negative as if it
were unsigned (and all bits are random) is not valid. Perhaps you can give
an example of how signed or unsigned would be handled differently.


The murmur hash is not a good example here as it returns the resulting
buffer as 2 signed longs.  The MD5 is probably a better example as the
messageDigest.digest() returns a byte[128].  The code then interprets that
as 2 signed longs.  A C implementation could interpret that as 2 unsigned
longs.  Additionally, there are hashes that return smaller buffers such as
with murmur2 which returns a 32bit unsigned int in the C code.  In java
that could be represented as a long as per the Integer.asUnsignedlong()
method or it could just be cast to a long and thus signed.

The values returned are then passed to floorMod( value, numberOfBits ) to
turn on the appropriate bit in the resulting bit vector.  As noted above
this will yield different values depending upon how the byte[] was
represented.

Claude

Thanks for the explanation. So the signedness is only appropriate when the returned long from the hash function is smaller than 64-bits. Should it be documented to indicate that the upper sign bit of the 64-bit long will never be set by the Hasher (see example below)?

Reading about floorMod(x, y) it will return a positive value for any x if y is positive. If the input x and y have the same sign you will get the same result as x % y. However floorMod has to have a condition to check if the signs are the same and so will be slower.

So floorMod could be replaced with x % y when the Signedness is UNSIGNED and y is positive, or when both x & y are known to be positive. Looking at the current code that seems to apply to AbstractBloomFilter.contains(Hasher) which is using int values from the Hasher and assumes they are positive (it uses idx / Long.SIZE as a positive index). So this could use idx % Long.SIZE instead of floorMod. This makes sense as the Hasher should have already sanitised the raw output from the hash function into a positive int index for the appropriate bit in the filter.

The same applies to HasherBloomFilter.getBits() which is using floorMod(idx, Long.SIZE) when a negative idx would throw an index out of bounds on the previous line.

So elimination of floorMod in two places can be done to increase speed. However on browsing at BitSet the divide and mod operations can be removed altogether using shifts:

    final int buffIdx = idx / Long.SIZE;
    final int pwr = Math.floorMod(idx, Long.SIZE);
    final long buffOffset = 1L << pwr;

    // Equivalent assuming idx is positive since the << left shift works using the first 6 bits.

    final int buffIdx = idx >> 6;
    final long buffOffset = 1L << idx;


This also opens the possibility of using the HashFunctionIdentity Signedness to do the modulus in DynamicHasher:

    /**
     * Identifies the signedness of the calculations for this function.
     * <p>
     * When the hash function executes it typically returns an array of bytes.      * That array is converted into one or more numerical values which will be provided
     * as a {@code long} primitive type.
     * The signedness identifies if those {@code long} values are signed or unsigned.      * For example a hash function that outputs only 32-bits can be unsigned if converted      * using {@link Integer#toUnsignedLong(int)}. A hash function that outputs more than
     * 64-bits is typically signed.
     * </p>
     */
    enum Signedness {
        /**
         * The result of {@link HashFunction#apply(byte[], int)} is signed,
         * thus the sign bit may be set.
         *
         * <p>The result can be used with {@code Math.floorMod(x, y)} to generate a positive
         * value if y is positive.
         *
         * @see Math#floorMod(int, int)
         */
        SIGNED {
            @Override
            int mod(long x, int y) {
                // Cast to long to workaround a bug in animal-sniffer.
                return (int) Math.floorMod(x, (long) y);
            }
        },
        /**
         * The result of {@link HashFunction#apply(byte[], int)} is unsigned,
         * thus the sign bit is never set.
         *
         * <p>The result can be used with {@code x % y} to generate a positive
         * value if y is positive.
         */
        UNSIGNED {
            @Override
            int mod(long x, int y) {
                return (int) (x % y);
            }
        };

        /**
         * Get the modulus of the arguments. This function assumes the divisor is positive.          * The modulus is to be used to provide a positive index from random input value
         * {@code x}.
         *
         * @param x the dividend
         * @param y the divisor (must be positive)
         * @return the modulus
         */
        abstract int mod(long x, int y);
    }

DynamicHasher.Iterator then uses:

    @Override
    public int nextInt() {
        if (hasNext()) {
            if (funcCount >= shape.getNumberOfHashFunctions()) {
                funcCount = 0;
                buffer++;
            }
            return function.getSignedness().mod(
                      function.apply(buffers.get(buffer), funcCount++),
                      shape.getNumberOfBits());
        }
        throw new NoSuchElementException();
    }

Adding this mod function is a speed optimisation that will doubtless come around with a bug because it is misused. So should I update the Signedness to use the comments as above (based on your PR and this discussion) but without the abstract mod function? For now we can leave Math.floorMod in the DynamicHasher and remove it from AbstractBloomFilter and HasherBloomFilter using bit shifts. A negative x in those cases would throw an index out of bounds exception in the same function.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to