On Sun, 21 Jan 2024 at 03:06, Jeff Davis <pg...@j-davis.com> wrote:
> Yes, thank you. I don't think we need to change the algorithm.

Jumping in here at a random point just to share my findings from
poking around this on and off. I am concentrating here on cstring
hashing as that is the most complicated one.

One thing that caught my eye in testing was that the unaligned cstring
code was unexpectedly faster for short strings (3-18B uniform
distribution). Looking into it the cause was  fasthash_accum() called
in the final iteration. In the unaligned case compiler (clang-15)
unrolled the inner loop which allowed it to jump directly into the
correct place in the switch. In the unaligned case clang decided to
use a data dependent jump which then mispredicts all of the time.

But given that we know the data length and we have it in a register
already, it's easy enough to just mask out data past the end with a
shift. See patch 1. Performance benefit is about 1.5x Measured on a
small test harness that just hashes and finalizes an array of strings,
with a data dependency between consecutive hashes (next address
depends on the previous hash output).

Unaligned case can actually take advantage of the same trick as the
aligned case, it just has to shuffle the data from two consecutive
words before applying the combine function. Patch 2 implements this.
It makes the unaligned case almost as fast as the aligned one, both on
short and long strings. 10% benefit on short strings, 50% on long
ones.

Not sure if the second one is worth the extra code. A different
approach would be to use the simple word at a time hashing for the
unaligned case too and handle word accesses that straddle a page
boundary as a special case. Obviously this only makes sense for
platforms that support unaligned access. On x86 unaligned access
within a cache line is basically free, and across cache lines is only
slightly more expensive. On benchmarks calling the aligned code on
unaligned strings only has a 5% penalty on long strings, short ones
are indistinguishable.

I also took a look at using SIMD for implementing the hash using the
same aligned access + shuffle trick. The good news is that the
shuffling works well enough that neither it nor checking for string
end are the longest chain. The bad news is that the data load,
alignment, zero finding and masking form a big dependency chain on the
first iteration. Mixing and finalization is even worse, fasthash uses
64bit imul instruction that has a 3 cycle latency, the iteration to
iteration chain is imul + xor, for 4 cycles or 2 B/cycle (in practice
a bit less due to ALU port contention). In SIMD registers there is no
64bit multiply, and 32 bit multiply has a terrible 10 cycle latency on
Intel. AES instructions are an interesting option, but it seems that 2
are needed for good enough mixing, at 4 cycles each, we again end up
at 2B/cycle. Finalization needs another 3 AES instructions, a shuffle
and a xor fold to pass SMHasher, for 17 cycles. The mix latency issue
could be worked around by doing more mixing in parallel, potentially
up to 8x faster, but this does not help short strings at all and would
make the code way bigger. SIMD code does use fewer instructions so it
interleaves better with nearby code that is not dependent on it, not
sure if that matters anywhere.

The short version is that for very long (4k+) strings the attached
SIMD code is 35% faster, for short strings it is 35% slower, and this
is very much x86-64-v3 only and would need a fallback when AVX and
AES-NI are not available. Basically a dead end for the use cases this
hash function is used for.

Regards,
Ants Aasma
From 912f46be12536985dda7bcfb669d4ec13e79d073 Mon Sep 17 00:00:00 2001
From: Ants Aasma <a...@cybertec.at>
Date: Mon, 29 Jan 2024 21:07:44 +0200
Subject: [PATCH 2/2] Unaligned fasthash word at a time hashing

About 10% performance benefit on short strings, 50% on long ones,
making the performance almost identical to the aligned case.
---
 src/include/common/hashfn_unstable.h | 156 +++++++++++++++++++++++----
 1 file changed, 138 insertions(+), 18 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 8ee1b99a204..1e44814d84a 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -189,6 +189,38 @@ first_byte_nonzero(uint64 v)
 #endif
 }
 
+/*
+ * Selects first n bits in memory order and masks the rest with NUL.
+ * Using value 0 for n results in undefined behavior.
+ */
+static inline uint64
+first_n64(uint64 v, uint64 n)
+{
+	Assert(0 < n && n <= 64);
+#ifdef WORDS_BIGENDIAN
+		return v & ((~0ULL) << (64 - n));
+#else
+		return v & ((~0ULL) >> (64 - n));
+#endif
+}
+
+/*
+ * Does the equivalent of an unaligned word access into two consecutive
+ * words, taking the last 8 - offset bytes from first and adding first
+ * offset bytes from second word. offset must be in range [1..7]
+ */
+static inline uint64
+align_n64(uint64 a, uint64 b, int offset)
+{
+	Assert(offset > 0 && offset < 8);
+#ifdef WORDS_BIGENDIAN
+		return (a << (offset * 8)) | (b >> (64 - offset * 8));
+#else
+		return (a >> (offset * 8)) | (b << (64 - offset * 8));
+#endif
+}
+
+
 /*
  * all-purpose workhorse for fasthash_accum_cstring
  */
@@ -212,17 +244,15 @@ fasthash_accum_cstring_unaligned(fasthash_state *hs, const char *str)
 }
 
 /*
- * specialized workhorse for fasthash_accum_cstring
+ * specialized workhorse for fasthash_accum_cstring_autoaligned
  *
- * With an aligned pointer, we consume the string a word at a time.
- * Loading the word containing the NUL terminator cannot segfault since
- * allocation boundaries are suitably aligned.
+ * If the string is aligned we don't need to correct for alignment.
  */
 static inline int
 fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 {
 	const char *const start = str;
-	int			remainder;
+	int			remainder_bits;
 	uint64		zero_bytes_le;
 	uint64		chunk;
 
@@ -259,18 +289,111 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 */
 	if (first_byte_nonzero(chunk))
 	{
-		remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE;
+		remainder_bits = pg_rightmost_one_pos64(zero_bytes_le);
+		hs->accum = first_n64(chunk, remainder_bits);
+		fasthash_combine(hs);
+
+		str += remainder_bits / BITS_PER_BYTE;
+	}
+
+	return str - start;
+}
+
+/*
+ * specialized implementation for fasthash_accum_cstring
+ *
+ * With an aligned pointer, we consume the string a word at a time.
+ * Loading the word containing the NUL terminator cannot segfault since
+ * allocation boundaries are suitably aligned.
+ *
+ * If the pointer is not aligned we can still load a word at a time, but
+ * have to combine the hashed word from two words using the alignment
+ * offset.
+ */
+static inline int
+fasthash_accum_cstring_autoaligned(fasthash_state *hs, const char *str)
+{
+	const char *const start = str;
+	int			remainder_bits;
+	uint64		zero_bytes_le;
+	int			offset;
+	uint64		chunk, prev;
+
+	offset = (uintptr_t) str & 7;
+
+	/*
+	 * Special case aligned string. Needed to avoid shift by 64 in
+	 * alignment code. It's also faster.
+	 */
+	if (!offset)
+		return fasthash_accum_cstring_aligned(hs, str);
+
+	/* Not using MAXALIGN here because we need to round down */
+	str = (const char *) ((uintptr_t) str & ~7);
+	chunk = *(uint64 *) str;
+	/* Mask out the preceding bytes with -1 */
+	chunk |= (-1ULL >> (64 - 8*offset));
+	zero_bytes_le = haszero64(chunk);
+	/* String ends in first loaded word. */
+	if (zero_bytes_le) {
+		remainder_bits = pg_rightmost_one_pos64(zero_bytes_le);
+
+		/*
+		 * Special case empty string to avoid shift by 64 below and have
+		 * the number of combine operations match the unaligned version.
+		 **/
+		if (remainder_bits < BITS_PER_BYTE)
+		{
+			return 0;
+		}
+
+		/* Mask out everything past the last byte and align */
+		hs->accum = align_n64(first_n64(chunk , remainder_bits), 0, offset);
+		fasthash_combine(hs);
+		return remainder_bits / BITS_PER_BYTE - offset;
+	}
+
+	prev = chunk;
+	str += FH_SIZEOF_ACCUM;
+	for (;;)
+	{
+		chunk = *(uint64 *) str;
+
+		/*
+		 * With little-endian representation, we can use this calculation,
+		 * which sets bits in the first byte in the result word that
+		 * corresponds to a zero byte in the original word. The rest of the
+		 * bytes are indeterminate, so cannot be used on big-endian machines
+		 * without either swapping or a bytewise check.
+		 */
 #ifdef WORDS_BIGENDIAN
-		hs->accum = chunk & ((~0ULL) << (64 - BITS_PER_BYTE*remainder));
+		zero_bytes_le = haszero64(pg_bswap64(chunk));
 #else
-		hs->accum = chunk & ((~0ULL) >> (64 - BITS_PER_BYTE*remainder));
+		zero_bytes_le = haszero64(chunk);
 #endif
+		if (zero_bytes_le)
+			break;
+
+		hs->accum = align_n64(prev, chunk, offset);
 		fasthash_combine(hs);
+		str += FH_SIZEOF_ACCUM;
+		prev = chunk;
+	}
+
+	remainder_bits = pg_rightmost_one_pos64(zero_bytes_le);
+	/* Relying on 0 remaining character having 7 zero bits */
+	chunk = first_n64(chunk, remainder_bits);
+
+	hs->accum = align_n64(prev, chunk, offset);
+	fasthash_combine(hs);
 
-		str += remainder;
+	if (remainder_bits / BITS_PER_BYTE > offset)
+	{
+		hs->accum = align_n64(chunk, 0, offset);
+		fasthash_combine(hs);
 	}
 
-	return str - start;
+	return (str - start) + remainder_bits / BITS_PER_BYTE;
 }
 
 /*
@@ -289,19 +412,16 @@ fasthash_accum_cstring(fasthash_state *hs, const char *str)
 	memcpy(&hs_check, hs, sizeof(fasthash_state));
 	len_check = fasthash_accum_cstring_unaligned(&hs_check, str);
 #endif
-	if (PointerIsAligned(str, uint64))
-	{
-		len = fasthash_accum_cstring_aligned(hs, str);
-		Assert(hs_check.hash == hs->hash && len_check == len);
-		return len;
-	}
-#endif							/* SIZEOF_VOID_P */
-
+	len = fasthash_accum_cstring_autoaligned(hs, str);
+	Assert(hs_check.hash == hs->hash && len_check == len);
+	return len;
+#else
 	/*
 	 * It's not worth it to try to make the word-at-a-time optimization work
 	 * on 32-bit platforms.
 	 */
 	return fasthash_accum_cstring_unaligned(hs, str);
+#endif							/* SIZEOF_VOID_P */
 }
 
 /*
-- 
2.34.1

From b226898acc15c329cf73308ff9d77f0a15f08322 Mon Sep 17 00:00:00 2001
From: Ants Aasma <a...@cybertec.at>
Date: Mon, 29 Jan 2024 15:16:02 +0200
Subject: [PATCH 1/2] Speed up last iteration of aligned fasthash

We know the length of the string so we can mask out end of the
string with a shift. Without this the aligned version was slower
than unaligned on small strings.
---
 src/include/common/hashfn_unstable.h | 31 ++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 3d927e1fb18..8ee1b99a204 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -176,6 +176,19 @@ fasthash_accum(fasthash_state *hs, const char *k, int len)
 #define haszero64(v) \
 	(((v) - 0x0101010101010101) & ~(v) & 0x8080808080808080)
 
+/*
+ * Returns non-zero when first byte in memory order is not NUL
+ */
+static inline int
+first_byte_nonzero(uint64 v)
+{
+#ifdef WORDS_BIGENDIAN
+		return v >> 56;
+#else
+		return v & 0xFF;
+#endif
+}
+
 /*
  * all-purpose workhorse for fasthash_accum_cstring
  */
@@ -211,11 +224,12 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	const char *const start = str;
 	int			remainder;
 	uint64		zero_bytes_le;
+	uint64		chunk;
 
 	Assert(PointerIsAligned(start, uint64));
 	for (;;)
 	{
-		uint64		chunk = *(uint64 *) str;
+		chunk = *(uint64 *) str;
 
 		/*
 		 * With little-endian representation, we can use this calculation,
@@ -243,9 +257,18 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 * byte within the input word by counting the number of trailing (because
 	 * little-endian) zeros and dividing the result by 8.
 	 */
-	remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE;
-	fasthash_accum(hs, str, remainder);
-	str += remainder;
+	if (first_byte_nonzero(chunk))
+	{
+		remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE;
+#ifdef WORDS_BIGENDIAN
+		hs->accum = chunk & ((~0ULL) << (64 - BITS_PER_BYTE*remainder));
+#else
+		hs->accum = chunk & ((~0ULL) >> (64 - BITS_PER_BYTE*remainder));
+#endif
+		fasthash_combine(hs);
+
+		str += remainder;
+	}
 
 	return str - start;
 }
-- 
2.34.1

#include <inttypes.h>
#include <immintrin.h>

typedef uint64_t uint64;

#define true 1

#define FNV_PRIME 16777619

#define seq _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

// short aliases for intrinsics for readability

// AVX2  Latency 1, TP: 1 on Intel, 2-3 on Zen
#define broadcastb _mm_set1_epi8
#define broadcastd _mm_set1_epi32
// AVX Latency 1, TP: 2-4
#define cmpeqb _mm_cmpeq_epi8
#define cmpgtb _mm_cmpgt_epi8
// AVX Latency: 3 on Intel, 5 on AMD, TP: 1
#define movmskb _mm_movemask_epi8
// AVX Latency: 1 (2 for Zen4), TP: 2 (1 pre Ice Lake)
#define pshufb _mm_shuffle_epi8
// AVX Latency: 1, TP: 3 (4 for Zen4)
#define paddb _mm_add_epi8
// AVX Latency: 10 on Intel, 3-4 on AMD TP: 1 on Intel, 2 on >=Zen3
#define pmulld _mm_mullo_epi32
// AVX Latency: 1 TP: 2 on >=Ice Lake
#define psrldq _mm_srli_si128
#define psrld _mm_srli_epi32
// BMI1 Latency: 3, 2 on AMD (1 on Zen4), TP: 1 on Intel 2 on AMD
#define tzcnt _tzcnt_u32
#define aesenc _mm_aesenc_si128

static inline __m128i
align_vec_single(__m128i a, int offset)
{
    /*
     * Shift bytes to start of vector, replace shifted in bytes with 0
     * by having 0x70+seq+offset overflow into high bit, which makes pshufb
     * replace that byte with 0.
     *
     * Relying on compiler to lift the loop invariant paddb out of loops.
     **/ 
    return pshufb(a, paddb(seq, broadcastb(0x70+offset)));
}

static inline __m128i
align_vec(__m128i a, __m128i b, int offset)
{
    /*
     * Or together last offset bytes shifted right by offset with first offset
     * bytes. Have overflow and underflow into negative replace bytes with
     * zero.
     **/
    return pshufb(a, paddb(seq, broadcastb(0x70+offset)))
         | pshufb(b, paddb(seq, broadcastb(-16 + offset)));
}

static inline int
find_zeroes(__m128i a)
{
    return movmskb(cmpeqb(a, broadcastb(0)));
}

#ifdef HAS_AES
// lifted from https://github.com/tildeleb/aeshash/blob/master/aeshash.go
#define key1 _mm_setr_epi8(0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08,\
                           0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10)
#define key2 _mm_setr_epi8(0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18,\
                           0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0xFF)

static inline __m128i
mix(__m128i hash, __m128i data)
{
    hash = aesenc(hash, key1);
    return aesenc(hash, data);
}

static inline uint64
finalize(__m128i hash)
{
    hash = pshufb(hash, _mm_setr_epi8(0, 5, 10, 15,
                                      4, 9, 14, 3,
                                      8, 13, 2, 7,
                                      12, 1, 6, 11));
    hash = aesenc(hash, key1);
    hash = aesenc(hash, key2);
    hash = aesenc(hash, key1);
        
    return (hash ^ psrldq(hash, 8))[0];
}
    

#else
static inline __m128i
mix(__m128i hash, __m128i data)
{
    __m128i tmp = hash ^ data;
    return pmulld(tmp, broadcastd(FNV_PRIME)) ^ psrld(tmp, 17);
}

static inline uint64
finalize(__m128i hash)
{
    // Mix vector rotated by a single word so high word gets mixed in with
    // low word in 64bit result
    hash = mix(hash, pshufb(hash, paddb(seq, broadcastb(4))));
    return (hash ^ psrldq(hash, 8))[0];
}
#endif


/*
 * Fast vector hash calculates 4 parallel 32bit hashes across the data. The
 * finalizer is responsible for mixing the parallel hashes to a single value.
 * For strings that are not a multiple of 16 bytes the string is padded with
 * zeroes.
 *
 * It would possible to pad to 4 byte alignment and skip the final hash
 * iteration for the hashes with a missing value, but hashing in zeroes is
 * faster and only improves mixing.
 */
uint64
fast_vec_hash_cstring_aligned(__m128i hash, char *buf)
{
    char *cur = buf;
    while (true)
    {
        __m128i chunk = *(__m128i*) cur;
        int mask = find_zeroes(chunk);
        if (mask) {
            int end = tzcnt(mask);
            // Mask out everything past the end
            chunk &= cmpgtb(broadcastb(end), seq);
            return finalize(mix(hash, chunk));
        }
        hash = mix(hash, chunk);
        cur += sizeof(chunk);
    }
}

/*
 * Unaligned version of vectorized hash performs alignment in SIMD vectors.
 * x86 supports unaligned loads, but we don't want to use it for two reasons.
 * Loads that straddle cache line boundaries are significantly slower than 
 * loads within a cacheline, even more so for loads across page boundaries.
 * More importantly loads crossing page boundaries might segfault if the next
 * page happens to be unallocated.
 */
uint64
fast_vec_hash_cstring(__m128i hash, char *buf)
{
    int offset = ((uintptr_t) buf) & (sizeof(__m128i) - 1);

#ifdef SPECIAL_CASE_ALIGNED 
    /*
     * Instruction analysis shows that inner loop is mixing latency bound
     * so alignment overhead should not matter.
     **/ 
    if (SPECIAL_CASE_ALIGNED && !offset)
        return fast_vec_hash_cstring_aligned(hash, buf);
#endif

    char *cur = buf - offset;
    __m128i chunk = *(__m128i*) cur;

    // Mask out first offset bytes to not match string end there
    int mask = find_zeroes(chunk) & (~0L << offset);
    // If string ends in first chunk can return immediately
    if (mask)
    {
        int end = tzcnt(mask);
        // Mask out everything past the end
        chunk &= cmpgtb(broadcastb(end), seq);        
        return finalize(mix(hash, align_vec_single(chunk, offset)));
    }

    /*
     * Need to keep track of 2 vectors to perform alignment.
     *
     * _ <- already hashed, or before string
     * # <- data for next iteration if string did not end
     * prev                     chunk
     * [____|_012|3456|789A]    [BCDE|F###|####|####]
     *        ^offset
     */
    cur += sizeof(chunk);
    
    while (true)
    {
        __m128i prev = chunk;
        chunk = *(__m128i*) cur;
        mask = find_zeroes(chunk);
        // Found end of string
        if (mask) {
            int end = tzcnt(mask);
            // Mask everything past end of string with 0
            chunk &= cmpgtb(broadcastb(end), seq);
            hash = mix(hash, align_vec(prev, chunk, offset));
            if (end > offset) {
                hash = mix(hash, align_vec_single(chunk, offset));
            }
            return finalize(hash);
        }
        hash = mix(hash, align_vec(prev, chunk, offset));
        cur += sizeof(chunk);
    }
}

Reply via email to