On Wed, Oct 02, 2024 at 08:15:38PM +0100, Jonathan Wakely wrote: > On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwak...@redhat.com> wrote: > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwak...@redhat.com> wrote: > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d...@ilvokhin.com> wrote: > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > loop using switch statement. > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com> > > > > --- > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > index 3665375096a..294a7323dd0 100644 > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > @@ -50,10 +50,29 @@ namespace > > > > load_bytes(const char* p, int n) > > > > { > > > > std::size_t result = 0; > > > > - --n; > > > > - do > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > - while (--n >= 0); > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > only hash the first 8 bytes. > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > It seems to be slower for short strings, but a win overall: > https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8 > This measures different lengths, and tries to ensure that the string > contents aren't treated as constant.
Nice find, thanks for spotting it. In retrospect, I think case n = 7 is best for unrolled switch version and worst for loop version, because it maximize overhead from loop control flow instructions. In contrast, case n = 1 is completely opposite in that sense, we have minimal overhead from loop control flow. If explanation above is correct, then not sure we can do something better for small cases. Seems like this lunch is not completely free. > > > > > > > > > > > > + switch(n & 7) > > > > + { > > > > + case 7: > > > > + result |= std::size_t(p[6]) << 48; > > > > + [[gnu::fallthrough]]; > > > > + case 6: > > > > + result |= std::size_t(p[5]) << 40; > > > > + [[gnu::fallthrough]]; > > > > + case 5: > > > > + result |= std::size_t(p[4]) << 32; > > > > + [[gnu::fallthrough]]; > > > > + case 4: > > > > + result |= std::size_t(p[3]) << 24; > > > > + [[gnu::fallthrough]]; > > > > + case 3: > > > > + result |= std::size_t(p[2]) << 16; > > > > + [[gnu::fallthrough]]; > > > > + case 2: > > > > + result |= std::size_t(p[1]) << 8; > > > > + [[gnu::fallthrough]]; > > > > + case 1: > > > > + result |= std::size_t(p[0]); > > > > + }; > > > > return result; > > > > } > > > > > > > > -- > > > > 2.43.5 > > > > > Forgot to CC mailing lists. Sorry about that.