Instead of looping over every byte of the tail, unroll loop manually
using switch statement, then compilers (at least GCC and Clang) will
generate a jump table [1], which is faster on a microbenchmark [2].

[1]: https://godbolt.org/z/aE8Mq3j5G
[2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24

libstdc++-v3/ChangeLog:

        * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
          loop using switch statement.

Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com>
---
 libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc 
b/libstdc++-v3/libsupc++/hash_bytes.cc
index 3665375096a..294a7323dd0 100644
--- a/libstdc++-v3/libsupc++/hash_bytes.cc
+++ b/libstdc++-v3/libsupc++/hash_bytes.cc
@@ -50,10 +50,29 @@ namespace
   load_bytes(const char* p, int n)
   {
     std::size_t result = 0;
-    --n;
-    do
-      result = (result << 8) + static_cast<unsigned char>(p[n]);
-    while (--n >= 0);
+    switch(n & 7)
+      {
+      case 7:
+       result |= std::size_t(p[6]) << 48;
+       [[gnu::fallthrough]];
+      case 6:
+       result |= std::size_t(p[5]) << 40;
+       [[gnu::fallthrough]];
+      case 5:
+       result |= std::size_t(p[4]) << 32;
+       [[gnu::fallthrough]];
+      case 4:
+       result |= std::size_t(p[3]) << 24;
+       [[gnu::fallthrough]];
+      case 3:
+       result |= std::size_t(p[2]) << 16;
+       [[gnu::fallthrough]];
+      case 2:
+       result |= std::size_t(p[1]) << 8;
+       [[gnu::fallthrough]];
+      case 1:
+       result |= std::size_t(p[0]);
+      };
     return result;
   }
 
-- 
2.43.5

Reply via email to