Instead of looping over every byte of the tail, unroll loop manually using switch statement, then compilers (at least GCC and Clang) will generate a jump table [1], which is faster on a microbenchmark [2].
[1]: https://godbolt.org/z/aE8Mq3j5G [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 libstdc++-v3/ChangeLog: * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll loop using switch statement. Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com> --- libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc index 3665375096a..294a7323dd0 100644 --- a/libstdc++-v3/libsupc++/hash_bytes.cc +++ b/libstdc++-v3/libsupc++/hash_bytes.cc @@ -50,10 +50,29 @@ namespace load_bytes(const char* p, int n) { std::size_t result = 0; - --n; - do - result = (result << 8) + static_cast<unsigned char>(p[n]); - while (--n >= 0); + switch(n & 7) + { + case 7: + result |= std::size_t(p[6]) << 48; + [[gnu::fallthrough]]; + case 6: + result |= std::size_t(p[5]) << 40; + [[gnu::fallthrough]]; + case 5: + result |= std::size_t(p[4]) << 32; + [[gnu::fallthrough]]; + case 4: + result |= std::size_t(p[3]) << 24; + [[gnu::fallthrough]]; + case 3: + result |= std::size_t(p[2]) << 16; + [[gnu::fallthrough]]; + case 2: + result |= std::size_t(p[1]) << 8; + [[gnu::fallthrough]]; + case 1: + result |= std::size_t(p[0]); + }; return result; } -- 2.43.5