Re: [PATCH] libstdc++: Unroll loop in load_bytes function

Jonathan Wakely Fri, 04 Oct 2024 06:07:46 -0700

On Fri, 4 Oct 2024 at 13:53, Dmitry Ilvokhin <d...@ilvokhin.com> wrote:
>
> On Fri, Oct 04, 2024 at 10:20:27AM +0100, Jonathan Wakely wrote:
> > On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwak...@redhat.com> wrote:
> > >
> > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guent...@gmail.com> 
> > > wrote:
> > > >
> > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwak...@redhat.com> 
> > > > wrote:
> > > > >
> > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwak...@redhat.com> 
> > > > > wrote:
> > > > > >
> > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d...@ilvokhin.com> 
> > > > > > wrote:
> > > > > > >
> > > > > > > Instead of looping over every byte of the tail, unroll loop 
> > > > > > > manually
> > > > > > > using switch statement, then compilers (at least GCC and Clang) 
> > > > > > > will
> > > > > > > generate a jump table [1], which is faster on a microbenchmark 
> > > > > > > [2].
> > > > > > >
> > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > > > > >
> > > > > > > libstdc++-v3/ChangeLog:
> > > > > > >
> > > > > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): 
> > > > > > > unroll
> > > > > > >           loop using switch statement.
> > > > > > >
> > > > > > > Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com>
> > > > > > > ---
> > > > > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 
> > > > > > > +++++++++++++++++++++++----
> > > > > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc 
> > > > > > > b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > > index 3665375096a..294a7323dd0 100644
> > > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > > @@ -50,10 +50,29 @@ namespace
> > > > > > >    load_bytes(const char* p, int n)
> > > > > > >    {
> > > > > > >      std::size_t result = 0;
> > > > > > > -    --n;
> > > > > > > -    do
> > > > > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > > > > -    while (--n >= 0);
> > > > > >
> > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > > > > only hash the first 8 bytes.
> > > > >
> > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> > > >
> > > > The compiler should do such transforms - you probably want to tell
> > > > it that n < 8 though, it likely doesn't (always) know.
> > >
> > > e.g. like this?
> > >
> > >     if ((n & 7) != n)
> > >       __builtin_unreachable();
> > >
> > > For the microbenchmark that seems to make things consistently worse:
> > > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw
> >
> > Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <=
> > n is always true too.
> >
>
> GCC still wasn't able to unroll the loop, even with a
> __builtin_unreachable, but benchmark link you mentioned above uses -O2
> optimization level (not sure if it was intentional).


That was intentional, because that's how libsupc++/hash_bytes.cc gets compiled.

>
> If we'll use -O3 [1], then GCC was able to unroll the loop for
> load_bytes_loop_assume version, but at the same time I am not sure all
> loop control instructions were elided, I still can see them on Godbolt
> version of generated code [2]. Benchmark charts partially confirm that,
> because performance of load_bytes_loop and load_bytes_loop_assume are
> now quite close (same actually, except case n = 1). I guess it would
> make sense, as we execute same amount of instructions.
>
> In addition, chart for load_bytes_switch look quite jumpy for [1] and
> became better for cases n = 1 and n = 2. At this point I am not sure it
> is not a code alignment issue and we are not measuring noise.
>
> [1]: https://quick-bench.com/q/LlcgMVhL61CasZVjCWbHd3uid8w
> [2]: https://godbolt.org/z/qPf1n7xWs
>

Re: [PATCH] libstdc++: Unroll loop in load_bytes function

Reply via email to