On Sun, 21 Jan 2024 at 03:06, Jeff Davis <pg...@j-davis.com> wrote: > Yes, thank you. I don't think we need to change the algorithm.
Jumping in here at a random point just to share my findings from poking around this on and off. I am concentrating here on cstring hashing as that is the most complicated one. One thing that caught my eye in testing was that the unaligned cstring code was unexpectedly faster for short strings (3-18B uniform distribution). Looking into it the cause was fasthash_accum() called in the final iteration. In the unaligned case compiler (clang-15) unrolled the inner loop which allowed it to jump directly into the correct place in the switch. In the unaligned case clang decided to use a data dependent jump which then mispredicts all of the time. But given that we know the data length and we have it in a register already, it's easy enough to just mask out data past the end with a shift. See patch 1. Performance benefit is about 1.5x Measured on a small test harness that just hashes and finalizes an array of strings, with a data dependency between consecutive hashes (next address depends on the previous hash output). Unaligned case can actually take advantage of the same trick as the aligned case, it just has to shuffle the data from two consecutive words before applying the combine function. Patch 2 implements this. It makes the unaligned case almost as fast as the aligned one, both on short and long strings. 10% benefit on short strings, 50% on long ones. Not sure if the second one is worth the extra code. A different approach would be to use the simple word at a time hashing for the unaligned case too and handle word accesses that straddle a page boundary as a special case. Obviously this only makes sense for platforms that support unaligned access. On x86 unaligned access within a cache line is basically free, and across cache lines is only slightly more expensive. On benchmarks calling the aligned code on unaligned strings only has a 5% penalty on long strings, short ones are indistinguishable. I also took a look at using SIMD for implementing the hash using the same aligned access + shuffle trick. The good news is that the shuffling works well enough that neither it nor checking for string end are the longest chain. The bad news is that the data load, alignment, zero finding and masking form a big dependency chain on the first iteration. Mixing and finalization is even worse, fasthash uses 64bit imul instruction that has a 3 cycle latency, the iteration to iteration chain is imul + xor, for 4 cycles or 2 B/cycle (in practice a bit less due to ALU port contention). In SIMD registers there is no 64bit multiply, and 32 bit multiply has a terrible 10 cycle latency on Intel. AES instructions are an interesting option, but it seems that 2 are needed for good enough mixing, at 4 cycles each, we again end up at 2B/cycle. Finalization needs another 3 AES instructions, a shuffle and a xor fold to pass SMHasher, for 17 cycles. The mix latency issue could be worked around by doing more mixing in parallel, potentially up to 8x faster, but this does not help short strings at all and would make the code way bigger. SIMD code does use fewer instructions so it interleaves better with nearby code that is not dependent on it, not sure if that matters anywhere. The short version is that for very long (4k+) strings the attached SIMD code is 35% faster, for short strings it is 35% slower, and this is very much x86-64-v3 only and would need a fallback when AVX and AES-NI are not available. Basically a dead end for the use cases this hash function is used for. Regards, Ants Aasma
From 912f46be12536985dda7bcfb669d4ec13e79d073 Mon Sep 17 00:00:00 2001 From: Ants Aasma <a...@cybertec.at> Date: Mon, 29 Jan 2024 21:07:44 +0200 Subject: [PATCH 2/2] Unaligned fasthash word at a time hashing About 10% performance benefit on short strings, 50% on long ones, making the performance almost identical to the aligned case. --- src/include/common/hashfn_unstable.h | 156 +++++++++++++++++++++++---- 1 file changed, 138 insertions(+), 18 deletions(-) diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h index 8ee1b99a204..1e44814d84a 100644 --- a/src/include/common/hashfn_unstable.h +++ b/src/include/common/hashfn_unstable.h @@ -189,6 +189,38 @@ first_byte_nonzero(uint64 v) #endif } +/* + * Selects first n bits in memory order and masks the rest with NUL. + * Using value 0 for n results in undefined behavior. + */ +static inline uint64 +first_n64(uint64 v, uint64 n) +{ + Assert(0 < n && n <= 64); +#ifdef WORDS_BIGENDIAN + return v & ((~0ULL) << (64 - n)); +#else + return v & ((~0ULL) >> (64 - n)); +#endif +} + +/* + * Does the equivalent of an unaligned word access into two consecutive + * words, taking the last 8 - offset bytes from first and adding first + * offset bytes from second word. offset must be in range [1..7] + */ +static inline uint64 +align_n64(uint64 a, uint64 b, int offset) +{ + Assert(offset > 0 && offset < 8); +#ifdef WORDS_BIGENDIAN + return (a << (offset * 8)) | (b >> (64 - offset * 8)); +#else + return (a >> (offset * 8)) | (b << (64 - offset * 8)); +#endif +} + + /* * all-purpose workhorse for fasthash_accum_cstring */ @@ -212,17 +244,15 @@ fasthash_accum_cstring_unaligned(fasthash_state *hs, const char *str) } /* - * specialized workhorse for fasthash_accum_cstring + * specialized workhorse for fasthash_accum_cstring_autoaligned * - * With an aligned pointer, we consume the string a word at a time. - * Loading the word containing the NUL terminator cannot segfault since - * allocation boundaries are suitably aligned. + * If the string is aligned we don't need to correct for alignment. */ static inline int fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str) { const char *const start = str; - int remainder; + int remainder_bits; uint64 zero_bytes_le; uint64 chunk; @@ -259,18 +289,111 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str) */ if (first_byte_nonzero(chunk)) { - remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE; + remainder_bits = pg_rightmost_one_pos64(zero_bytes_le); + hs->accum = first_n64(chunk, remainder_bits); + fasthash_combine(hs); + + str += remainder_bits / BITS_PER_BYTE; + } + + return str - start; +} + +/* + * specialized implementation for fasthash_accum_cstring + * + * With an aligned pointer, we consume the string a word at a time. + * Loading the word containing the NUL terminator cannot segfault since + * allocation boundaries are suitably aligned. + * + * If the pointer is not aligned we can still load a word at a time, but + * have to combine the hashed word from two words using the alignment + * offset. + */ +static inline int +fasthash_accum_cstring_autoaligned(fasthash_state *hs, const char *str) +{ + const char *const start = str; + int remainder_bits; + uint64 zero_bytes_le; + int offset; + uint64 chunk, prev; + + offset = (uintptr_t) str & 7; + + /* + * Special case aligned string. Needed to avoid shift by 64 in + * alignment code. It's also faster. + */ + if (!offset) + return fasthash_accum_cstring_aligned(hs, str); + + /* Not using MAXALIGN here because we need to round down */ + str = (const char *) ((uintptr_t) str & ~7); + chunk = *(uint64 *) str; + /* Mask out the preceding bytes with -1 */ + chunk |= (-1ULL >> (64 - 8*offset)); + zero_bytes_le = haszero64(chunk); + /* String ends in first loaded word. */ + if (zero_bytes_le) { + remainder_bits = pg_rightmost_one_pos64(zero_bytes_le); + + /* + * Special case empty string to avoid shift by 64 below and have + * the number of combine operations match the unaligned version. + **/ + if (remainder_bits < BITS_PER_BYTE) + { + return 0; + } + + /* Mask out everything past the last byte and align */ + hs->accum = align_n64(first_n64(chunk , remainder_bits), 0, offset); + fasthash_combine(hs); + return remainder_bits / BITS_PER_BYTE - offset; + } + + prev = chunk; + str += FH_SIZEOF_ACCUM; + for (;;) + { + chunk = *(uint64 *) str; + + /* + * With little-endian representation, we can use this calculation, + * which sets bits in the first byte in the result word that + * corresponds to a zero byte in the original word. The rest of the + * bytes are indeterminate, so cannot be used on big-endian machines + * without either swapping or a bytewise check. + */ #ifdef WORDS_BIGENDIAN - hs->accum = chunk & ((~0ULL) << (64 - BITS_PER_BYTE*remainder)); + zero_bytes_le = haszero64(pg_bswap64(chunk)); #else - hs->accum = chunk & ((~0ULL) >> (64 - BITS_PER_BYTE*remainder)); + zero_bytes_le = haszero64(chunk); #endif + if (zero_bytes_le) + break; + + hs->accum = align_n64(prev, chunk, offset); fasthash_combine(hs); + str += FH_SIZEOF_ACCUM; + prev = chunk; + } + + remainder_bits = pg_rightmost_one_pos64(zero_bytes_le); + /* Relying on 0 remaining character having 7 zero bits */ + chunk = first_n64(chunk, remainder_bits); + + hs->accum = align_n64(prev, chunk, offset); + fasthash_combine(hs); - str += remainder; + if (remainder_bits / BITS_PER_BYTE > offset) + { + hs->accum = align_n64(chunk, 0, offset); + fasthash_combine(hs); } - return str - start; + return (str - start) + remainder_bits / BITS_PER_BYTE; } /* @@ -289,19 +412,16 @@ fasthash_accum_cstring(fasthash_state *hs, const char *str) memcpy(&hs_check, hs, sizeof(fasthash_state)); len_check = fasthash_accum_cstring_unaligned(&hs_check, str); #endif - if (PointerIsAligned(str, uint64)) - { - len = fasthash_accum_cstring_aligned(hs, str); - Assert(hs_check.hash == hs->hash && len_check == len); - return len; - } -#endif /* SIZEOF_VOID_P */ - + len = fasthash_accum_cstring_autoaligned(hs, str); + Assert(hs_check.hash == hs->hash && len_check == len); + return len; +#else /* * It's not worth it to try to make the word-at-a-time optimization work * on 32-bit platforms. */ return fasthash_accum_cstring_unaligned(hs, str); +#endif /* SIZEOF_VOID_P */ } /* -- 2.34.1
From b226898acc15c329cf73308ff9d77f0a15f08322 Mon Sep 17 00:00:00 2001 From: Ants Aasma <a...@cybertec.at> Date: Mon, 29 Jan 2024 15:16:02 +0200 Subject: [PATCH 1/2] Speed up last iteration of aligned fasthash We know the length of the string so we can mask out end of the string with a shift. Without this the aligned version was slower than unaligned on small strings. --- src/include/common/hashfn_unstable.h | 31 ++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h index 3d927e1fb18..8ee1b99a204 100644 --- a/src/include/common/hashfn_unstable.h +++ b/src/include/common/hashfn_unstable.h @@ -176,6 +176,19 @@ fasthash_accum(fasthash_state *hs, const char *k, int len) #define haszero64(v) \ (((v) - 0x0101010101010101) & ~(v) & 0x8080808080808080) +/* + * Returns non-zero when first byte in memory order is not NUL + */ +static inline int +first_byte_nonzero(uint64 v) +{ +#ifdef WORDS_BIGENDIAN + return v >> 56; +#else + return v & 0xFF; +#endif +} + /* * all-purpose workhorse for fasthash_accum_cstring */ @@ -211,11 +224,12 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str) const char *const start = str; int remainder; uint64 zero_bytes_le; + uint64 chunk; Assert(PointerIsAligned(start, uint64)); for (;;) { - uint64 chunk = *(uint64 *) str; + chunk = *(uint64 *) str; /* * With little-endian representation, we can use this calculation, @@ -243,9 +257,18 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str) * byte within the input word by counting the number of trailing (because * little-endian) zeros and dividing the result by 8. */ - remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE; - fasthash_accum(hs, str, remainder); - str += remainder; + if (first_byte_nonzero(chunk)) + { + remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE; +#ifdef WORDS_BIGENDIAN + hs->accum = chunk & ((~0ULL) << (64 - BITS_PER_BYTE*remainder)); +#else + hs->accum = chunk & ((~0ULL) >> (64 - BITS_PER_BYTE*remainder)); +#endif + fasthash_combine(hs); + + str += remainder; + } return str - start; } -- 2.34.1
#include <inttypes.h> #include <immintrin.h> typedef uint64_t uint64; #define true 1 #define FNV_PRIME 16777619 #define seq _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) // short aliases for intrinsics for readability // AVX2 Latency 1, TP: 1 on Intel, 2-3 on Zen #define broadcastb _mm_set1_epi8 #define broadcastd _mm_set1_epi32 // AVX Latency 1, TP: 2-4 #define cmpeqb _mm_cmpeq_epi8 #define cmpgtb _mm_cmpgt_epi8 // AVX Latency: 3 on Intel, 5 on AMD, TP: 1 #define movmskb _mm_movemask_epi8 // AVX Latency: 1 (2 for Zen4), TP: 2 (1 pre Ice Lake) #define pshufb _mm_shuffle_epi8 // AVX Latency: 1, TP: 3 (4 for Zen4) #define paddb _mm_add_epi8 // AVX Latency: 10 on Intel, 3-4 on AMD TP: 1 on Intel, 2 on >=Zen3 #define pmulld _mm_mullo_epi32 // AVX Latency: 1 TP: 2 on >=Ice Lake #define psrldq _mm_srli_si128 #define psrld _mm_srli_epi32 // BMI1 Latency: 3, 2 on AMD (1 on Zen4), TP: 1 on Intel 2 on AMD #define tzcnt _tzcnt_u32 #define aesenc _mm_aesenc_si128 static inline __m128i align_vec_single(__m128i a, int offset) { /* * Shift bytes to start of vector, replace shifted in bytes with 0 * by having 0x70+seq+offset overflow into high bit, which makes pshufb * replace that byte with 0. * * Relying on compiler to lift the loop invariant paddb out of loops. **/ return pshufb(a, paddb(seq, broadcastb(0x70+offset))); } static inline __m128i align_vec(__m128i a, __m128i b, int offset) { /* * Or together last offset bytes shifted right by offset with first offset * bytes. Have overflow and underflow into negative replace bytes with * zero. **/ return pshufb(a, paddb(seq, broadcastb(0x70+offset))) | pshufb(b, paddb(seq, broadcastb(-16 + offset))); } static inline int find_zeroes(__m128i a) { return movmskb(cmpeqb(a, broadcastb(0))); } #ifdef HAS_AES // lifted from https://github.com/tildeleb/aeshash/blob/master/aeshash.go #define key1 _mm_setr_epi8(0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08,\ 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10) #define key2 _mm_setr_epi8(0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18,\ 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0xFF) static inline __m128i mix(__m128i hash, __m128i data) { hash = aesenc(hash, key1); return aesenc(hash, data); } static inline uint64 finalize(__m128i hash) { hash = pshufb(hash, _mm_setr_epi8(0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11)); hash = aesenc(hash, key1); hash = aesenc(hash, key2); hash = aesenc(hash, key1); return (hash ^ psrldq(hash, 8))[0]; } #else static inline __m128i mix(__m128i hash, __m128i data) { __m128i tmp = hash ^ data; return pmulld(tmp, broadcastd(FNV_PRIME)) ^ psrld(tmp, 17); } static inline uint64 finalize(__m128i hash) { // Mix vector rotated by a single word so high word gets mixed in with // low word in 64bit result hash = mix(hash, pshufb(hash, paddb(seq, broadcastb(4)))); return (hash ^ psrldq(hash, 8))[0]; } #endif /* * Fast vector hash calculates 4 parallel 32bit hashes across the data. The * finalizer is responsible for mixing the parallel hashes to a single value. * For strings that are not a multiple of 16 bytes the string is padded with * zeroes. * * It would possible to pad to 4 byte alignment and skip the final hash * iteration for the hashes with a missing value, but hashing in zeroes is * faster and only improves mixing. */ uint64 fast_vec_hash_cstring_aligned(__m128i hash, char *buf) { char *cur = buf; while (true) { __m128i chunk = *(__m128i*) cur; int mask = find_zeroes(chunk); if (mask) { int end = tzcnt(mask); // Mask out everything past the end chunk &= cmpgtb(broadcastb(end), seq); return finalize(mix(hash, chunk)); } hash = mix(hash, chunk); cur += sizeof(chunk); } } /* * Unaligned version of vectorized hash performs alignment in SIMD vectors. * x86 supports unaligned loads, but we don't want to use it for two reasons. * Loads that straddle cache line boundaries are significantly slower than * loads within a cacheline, even more so for loads across page boundaries. * More importantly loads crossing page boundaries might segfault if the next * page happens to be unallocated. */ uint64 fast_vec_hash_cstring(__m128i hash, char *buf) { int offset = ((uintptr_t) buf) & (sizeof(__m128i) - 1); #ifdef SPECIAL_CASE_ALIGNED /* * Instruction analysis shows that inner loop is mixing latency bound * so alignment overhead should not matter. **/ if (SPECIAL_CASE_ALIGNED && !offset) return fast_vec_hash_cstring_aligned(hash, buf); #endif char *cur = buf - offset; __m128i chunk = *(__m128i*) cur; // Mask out first offset bytes to not match string end there int mask = find_zeroes(chunk) & (~0L << offset); // If string ends in first chunk can return immediately if (mask) { int end = tzcnt(mask); // Mask out everything past the end chunk &= cmpgtb(broadcastb(end), seq); return finalize(mix(hash, align_vec_single(chunk, offset))); } /* * Need to keep track of 2 vectors to perform alignment. * * _ <- already hashed, or before string * # <- data for next iteration if string did not end * prev chunk * [____|_012|3456|789A] [BCDE|F###|####|####] * ^offset */ cur += sizeof(chunk); while (true) { __m128i prev = chunk; chunk = *(__m128i*) cur; mask = find_zeroes(chunk); // Found end of string if (mask) { int end = tzcnt(mask); // Mask everything past end of string with 0 chunk &= cmpgtb(broadcastb(end), seq); hash = mix(hash, align_vec(prev, chunk, offset)); if (end > offset) { hash = mix(hash, align_vec_single(chunk, offset)); } return finalize(hash); } hash = mix(hash, align_vec(prev, chunk, offset)); cur += sizeof(chunk); } }