On Fri, 4 Oct 2024 at 13:53, Dmitry Ilvokhin <d...@ilvokhin.com> wrote: > > On Fri, Oct 04, 2024 at 10:20:27AM +0100, Jonathan Wakely wrote: > > On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwak...@redhat.com> wrote: > > > > > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guent...@gmail.com> > > > wrote: > > > > > > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwak...@redhat.com> > > > > wrote: > > > > > > > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwak...@redhat.com> > > > > > wrote: > > > > > > > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d...@ilvokhin.com> > > > > > > wrote: > > > > > > > > > > > > > > Instead of looping over every byte of the tail, unroll loop > > > > > > > manually > > > > > > > using switch statement, then compilers (at least GCC and Clang) > > > > > > > will > > > > > > > generate a jump table [1], which is faster on a microbenchmark > > > > > > > [2]. > > > > > > > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): > > > > > > > unroll > > > > > > > loop using switch statement. > > > > > > > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com> > > > > > > > --- > > > > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 > > > > > > > +++++++++++++++++++++++---- > > > > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > index 3665375096a..294a7323dd0 100644 > > > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > @@ -50,10 +50,29 @@ namespace > > > > > > > load_bytes(const char* p, int n) > > > > > > > { > > > > > > > std::size_t result = 0; > > > > > > > - --n; > > > > > > > - do > > > > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > > > > - while (--n >= 0); > > > > > > > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > > > > only hash the first 8 bytes. > > > > > > > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > > > > > > > The compiler should do such transforms - you probably want to tell > > > > it that n < 8 though, it likely doesn't (always) know. > > > > > > e.g. like this? > > > > > > if ((n & 7) != n) > > > __builtin_unreachable(); > > > > > > For the microbenchmark that seems to make things consistently worse: > > > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw > > > > Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <= > > n is always true too. > > > > GCC still wasn't able to unroll the loop, even with a > __builtin_unreachable, but benchmark link you mentioned above uses -O2 > optimization level (not sure if it was intentional).
That was intentional, because that's how libsupc++/hash_bytes.cc gets compiled. > > If we'll use -O3 [1], then GCC was able to unroll the loop for > load_bytes_loop_assume version, but at the same time I am not sure all > loop control instructions were elided, I still can see them on Godbolt > version of generated code [2]. Benchmark charts partially confirm that, > because performance of load_bytes_loop and load_bytes_loop_assume are > now quite close (same actually, except case n = 1). I guess it would > make sense, as we execute same amount of instructions. > > In addition, chart for load_bytes_switch look quite jumpy for [1] and > became better for cases n = 1 and n = 2. At this point I am not sure it > is not a code alignment issue and we are not measuring noise. > > [1]: https://quick-bench.com/q/LlcgMVhL61CasZVjCWbHd3uid8w > [2]: https://godbolt.org/z/qPf1n7xWs >