Re: [PATCH] libstdc++: Unroll loop in load_bytes function

Dmitry Ilvokhin Thu, 03 Oct 2024 05:26:15 -0700

On Wed, Oct 02, 2024 at 08:15:38PM +0100, Jonathan Wakely wrote:
> On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwak...@redhat.com> wrote:
> >
> > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwak...@redhat.com> wrote:
> > >
> > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d...@ilvokhin.com> wrote:
> > > >
> > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > >
> > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > >
> > > > libstdc++-v3/ChangeLog:
> > > >
> > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > >           loop using switch statement.
> > > >
> > > > Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com>
> > > > ---
> > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc 
> > > > b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > index 3665375096a..294a7323dd0 100644
> > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > @@ -50,10 +50,29 @@ namespace
> > > >    load_bytes(const char* p, int n)
> > > >    {
> > > >      std::size_t result = 0;
> > > > -    --n;
> > > > -    do
> > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > -    while (--n >= 0);
> > >
> > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > only hash the first 8 bytes.
> >
> > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> 
> It seems to be slower for short strings, but a win overall:
> https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8
> This measures different lengths, and tries to ensure that the string
> contents aren't treated as constant.


Nice find, thanks for spotting it.

In retrospect, I think case n = 7 is best for unrolled switch version
and worst for loop version, because it maximize overhead from loop
control flow instructions. In contrast, case n = 1 is completely
opposite in that sense, we have minimal overhead from loop control flow.

If explanation above is correct, then not sure we can do something
better for small cases. Seems like this lunch is not completely free.

> 
> >
> >
> > >
> > > > +    switch(n & 7)
> > > > +      {
> > > > +      case 7:
> > > > +       result |= std::size_t(p[6]) << 48;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 6:
> > > > +       result |= std::size_t(p[5]) << 40;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 5:
> > > > +       result |= std::size_t(p[4]) << 32;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 4:
> > > > +       result |= std::size_t(p[3]) << 24;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 3:
> > > > +       result |= std::size_t(p[2]) << 16;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 2:
> > > > +       result |= std::size_t(p[1]) << 8;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 1:
> > > > +       result |= std::size_t(p[0]);
> > > > +      };
> > > >      return result;
> > > >    }
> > > >
> > > > --
> > > > 2.43.5
> > > >
>

Forgot to CC mailing lists. Sorry about that.

Re: [PATCH] libstdc++: Unroll loop in load_bytes function

Reply via email to