On Thu, Aug 8, 2024 at 7:44 PM Alexander Monakov <amona...@ispras.ru> wrote:
>
>
> > On Wed, 7 Aug 2024, Richard Biener wrote:
> >
> > > OK with that change.
> > >
> > > Did you think about a AVX512 version (possibly with 32 byte vectors)?
> > > In case there's a more efficient variant of pshufb/pmovmskb available
> > > there - possibly
> > > the load on the branch unit could be lessened with using masking.
> >
> > Thanks for the idea; unfortunately I don't see any possible improvement.
> > It would trade pmovmskb-(test+jcc,fused) for ktest-jcc, so unless the
> > latencies are shorter it seems to be a wash. The only way to use fewer
> > branches seems to be employing longer vectors.
>
> A better answer is that of course we can reduce branching without AVX2
> by using two SSE vectors in place of one 32-byte AVX vector. In fact,
> with a bit of TLC this idea works out really well, and the result
> closely matches the AVX2 code in performance. To put that in numbers,
> on the testcase from my microbenchmark archive, unmodified GCC shows
>
>  Performance counter stats for 'gcc/cc1plus.orig -fsyntax-only -quiet 
> t-rawstr.cc' (9 runs):
>
>              39.13 msec task-clock:u                     #    1.101 CPUs 
> utilized               ( +-  0.30% )
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>              2,206      page-faults:u                    #   56.374 K/sec     
>                   ( +-  2.58% )
>         61,502,159      cycles:u                         #    1.572 GHz       
>                   ( +-  0.08% )
>            749,841      stalled-cycles-frontend:u        #    1.22% frontend 
> cycles idle        ( +-  0.20% )
>          6,831,862      stalled-cycles-backend:u         #   11.11% backend 
> cycles idle         ( +-  0.63% )
>        141,972,604      instructions:u                   #    2.31  insn per 
> cycle
>                                                   #    0.05  stalled cycles 
> per insn     ( +-  0.00% )
>         46,054,279      branches:u                       #    1.177 G/sec     
>                   ( +-  0.00% )
>            325,134      branch-misses:u                  #    0.71% of all 
> branches             ( +-  0.11% )
>
>           0.035550 +- 0.000373 seconds time elapsed  ( +-  1.05% )
>
>
> then with the AVX2 helper from my patchset we have
>
>  Performance counter stats for 'gcc/cc1plus.avx -fsyntax-only -quiet 
> t-rawstr.cc' (9 runs):
>
>              36.39 msec task-clock:u                     #    1.112 CPUs 
> utilized               ( +-  0.27% )
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>              2,208      page-faults:u                    #   60.677 K/sec     
>                   ( +-  2.55% )
>         56,527,349      cycles:u                         #    1.553 GHz       
>                   ( +-  0.09% )
>            728,417      stalled-cycles-frontend:u        #    1.29% frontend 
> cycles idle        ( +-  0.38% )
>          6,221,761      stalled-cycles-backend:u         #   11.01% backend 
> cycles idle         ( +-  1.58% )
>        141,296,340      instructions:u                   #    2.50  insn per 
> cycle
>                                                   #    0.04  stalled cycles 
> per insn     ( +-  0.00% )
>         45,758,162      branches:u                       #    1.257 G/sec     
>                   ( +-  0.00% )
>            295,042      branch-misses:u                  #    0.64% of all 
> branches             ( +-  0.12% )
>
>           0.032736 +- 0.000460 seconds time elapsed  ( +-  1.41% )
>
>
> and with the revised patch that uses SSSE3 more cleverly
>
>  Performance counter stats for 'gcc/cc1plus -fsyntax-only -quiet t-rawstr.cc' 
> (9 runs):
>
>              36.89 msec task-clock:u                     #    1.110 CPUs 
> utilized               ( +-  0.29% )
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>              2,374      page-faults:u                    #   64.349 K/sec     
>                   ( +-  3.77% )
>         56,556,237      cycles:u                         #    1.533 GHz       
>                   ( +-  0.11% )
>            733,192      stalled-cycles-frontend:u        #    1.30% frontend 
> cycles idle        ( +-  1.08% )
>          6,271,987      stalled-cycles-backend:u         #   11.09% backend 
> cycles idle         ( +-  1.89% )
>        142,743,102      instructions:u                   #    2.52  insn per 
> cycle
>                                                   #    0.04  stalled cycles 
> per insn     ( +-  0.00% )
>         45,646,829      branches:u                       #    1.237 G/sec     
>                   ( +-  0.00% )
>            295,155      branch-misses:u                  #    0.65% of all 
> branches             ( +-  0.11% )
>
>           0.033242 +- 0.000418 seconds time elapsed  ( +-  1.26% )
>
>
> Is the revised patch below still ok? I've rolled the configury changes into 
> it,
> and dropped the (now unnecessary) AVX2 helper (and cpuid/xgetbv detection).

Yes - thanks.  Less code to maintain is always good.

Richard.

> ===8<===
>
> From: Alexander Monakov <amona...@ispras.ru>
> Subject: [PATCH v2] libcpp: replace SSE4.2 helper with an SSSE3 one
>
> Since the characters we are searching for (CR, LF, '\', '?') all have
> distinct ASCII codes mod 16, PSHUFB can help match them all at once.
>
> Directly use the new helper if __SSSE3__ is defined. It makes the other
> helpers unused, so mark them inline to prevent warnings.
>
> Rewrite and simplify init_vectorized_lexer.
>
> libcpp/ChangeLog:
>
>         * config.in: Regenerate.
>         * configure: Regenerate.
>         * configure.ac: Check for SSSE3 instead of SSE4.2.
>         * files.cc (read_file_guts): Bump padding to 64 if HAVE_SSSE3.
>         * lex.cc (search_line_acc_char): Mark inline, not "unused".
>         (search_line_sse2): Mark inline.
>         (search_line_sse42): Replace with...
>         (search_line_ssse3): ... this new function.  Adjust the use...
>         (init_vectorized_lexer): ... here.  Simplify.
> ---
>  libcpp/config.in    |   4 +-
>  libcpp/configure    |   4 +-
>  libcpp/configure.ac |   6 +-
>  libcpp/files.cc     |  19 +++---
>  libcpp/lex.cc       | 150 ++++++++++++++++----------------------------
>  5 files changed, 73 insertions(+), 110 deletions(-)
>
> diff --git a/libcpp/config.in b/libcpp/config.in
> index 253ef03a3d..b2e2f4e842 100644
> --- a/libcpp/config.in
> +++ b/libcpp/config.in
> @@ -210,8 +210,8 @@
>  /* Define to 1 if you have the `putc_unlocked' function. */
>  #undef HAVE_PUTC_UNLOCKED
>
> -/* Define to 1 if you can assemble SSE4 insns. */
> -#undef HAVE_SSE4
> +/* Define to 1 if you can assemble SSSE3 insns. */
> +#undef HAVE_SSSE3
>
>  /* Define to 1 if you have the <stddef.h> header file. */
>  #undef HAVE_STDDEF_H
> diff --git a/libcpp/configure b/libcpp/configure
> index 32d6aaa306..1391081ba0 100755
> --- a/libcpp/configure
> +++ b/libcpp/configure
> @@ -9140,14 +9140,14 @@ case $target in
>  int
>  main ()
>  {
> -asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))
> +asm ("pshufb %xmm0, %xmm1")
>    ;
>    return 0;
>  }
>  _ACEOF
>  if ac_fn_c_try_compile "$LINENO"; then :
>
> -$as_echo "#define HAVE_SSE4 1" >>confdefs.h
> +$as_echo "#define HAVE_SSSE3 1" >>confdefs.h
>
>  fi
>  rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
> diff --git a/libcpp/configure.ac b/libcpp/configure.ac
> index b883fec776..981f97c4ab 100644
> --- a/libcpp/configure.ac
> +++ b/libcpp/configure.ac
> @@ -197,9 +197,9 @@ fi
>
>  case $target in
>    i?86-* | x86_64-*)
> -    AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))],
> -      [AC_DEFINE([HAVE_SSE4], [1],
> -                [Define to 1 if you can assemble SSE4 insns.])])
> +    AC_TRY_COMPILE([], [asm ("pshufb %xmm0, %xmm1")],
> +      [AC_DEFINE([HAVE_SSSE3], [1],
> +                [Define to 1 if you can assemble SSSE3 insns.])])
>  esac
>
>  # Enable --enable-host-shared.
> diff --git a/libcpp/files.cc b/libcpp/files.cc
> index 78f56e30bd..3775091d25 100644
> --- a/libcpp/files.cc
> +++ b/libcpp/files.cc
> @@ -693,7 +693,7 @@ static bool
>  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
>                 const char *input_charset)
>  {
> -  ssize_t size, total, count;
> +  ssize_t size, pad, total, count;
>    uchar *buf;
>    bool regular;
>
> @@ -732,11 +732,14 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
> location_t loc,
>         the majority of C source files.  */
>      size = 8 * 1024;
>
> -  /* The + 16 here is space for the final '\n' and 15 bytes of padding,
> -     used to quiet warnings from valgrind or Address Sanitizer, when the
> -     optimized lexer accesses aligned 16-byte memory chunks, including
> -     the bytes after the malloced, area, and stops lexing on '\n'.  */
> -  buf = XNEWVEC (uchar, size + 16);
> +#ifdef HAVE_SSSE3
> +  pad = 64;
> +#else
> +  pad = 16;
> +#endif
> +  /* The '+ PAD' here is space for the final '\n' and PAD-1 bytes of padding,
> +     allowing search_line_fast to use (possibly misaligned) vector loads.  */
> +  buf = XNEWVEC (uchar, size + pad);
>    total = 0;
>    while ((count = read (file->fd, buf + total, size - total)) > 0)
>      {
> @@ -747,7 +750,7 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
> location_t loc,
>           if (regular)
>             break;
>           size *= 2;
> -         buf = XRESIZEVEC (uchar, buf, size + 16);
> +         buf = XRESIZEVEC (uchar, buf, size + pad);
>         }
>      }
>
> @@ -765,7 +768,7 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
> location_t loc,
>
>    file->buffer = _cpp_convert_input (pfile,
>                                      input_charset,
> -                                    buf, size + 16, total,
> +                                    buf, size + pad, total,
>                                      &file->buffer_start,
>                                      &file->st.st_size);
>    file->buffer_valid = file->buffer;
> diff --git a/libcpp/lex.cc b/libcpp/lex.cc
> index 1591dcdf15..daf2c770bc 100644
> --- a/libcpp/lex.cc
> +++ b/libcpp/lex.cc
> @@ -225,10 +225,7 @@ acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
>     and branches without increasing the number of arithmetic operations.
>     It's almost certainly going to be a win with 64-bit word size.  */
>
> -static const uchar * search_line_acc_char (const uchar *, const uchar *)
> -  ATTRIBUTE_UNUSED;
> -
> -static const uchar *
> +static inline const uchar *
>  search_line_acc_char (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
>  {
>    const word_type repl_nl = acc_char_replicate ('\n');
> @@ -293,7 +290,7 @@ static const char repl_chars[4][16] 
> __attribute__((aligned(16))) = {
>
>  /* A version of the fast scanner using SSE2 vectorized byte compare insns.  
> */
>
> -static const uchar *
> +static inline const uchar *
>  #ifndef __SSE2__
>  __attribute__((__target__("sse2")))
>  #endif
> @@ -344,120 +341,83 @@ search_line_sse2 (const uchar *s, const uchar *end 
> ATTRIBUTE_UNUSED)
>    return (const uchar *)p + found;
>  }
>
> -#ifdef HAVE_SSE4
> -/* A version of the fast scanner using SSE 4.2 vectorized string insns.  */
> +#ifdef HAVE_SSSE3
> +/* A version of the fast scanner using SSSE3 shuffle (PSHUFB) insns.  */
>
> -static const uchar *
> -#ifndef __SSE4_2__
> -__attribute__((__target__("sse4.2")))
> +static inline const uchar *
> +#ifndef __SSSE3__
> +__attribute__((__target__("ssse3")))
>  #endif
> -search_line_sse42 (const uchar *s, const uchar *end)
> +search_line_ssse3 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
>  {
>    typedef char v16qi __attribute__ ((__vector_size__ (16)));
> -  static const v16qi search = { '\n', '\r', '?', '\\' };
> -
> -  uintptr_t si = (uintptr_t)s;
> -  uintptr_t index;
> -
> -  /* Check for unaligned input.  */
> -  if (si & 15)
> -    {
> -      v16qi sv;
> -
> -      if (__builtin_expect (end - s < 16, 0)
> -         && __builtin_expect ((si & 0xfff) > 0xff0, 0))
> -       {
> -         /* There are less than 16 bytes left in the buffer, and less
> -            than 16 bytes left on the page.  Reading 16 bytes at this
> -            point might generate a spurious page fault.  Defer to the
> -            SSE2 implementation, which already handles alignment.  */
> -         return search_line_sse2 (s, end);
> -       }
> -
> -      /* ??? The builtin doesn't understand that the PCMPESTRI read from
> -        memory need not be aligned.  */
> -      sv = __builtin_ia32_loaddqu ((const char *) s);
> -      index = __builtin_ia32_pcmpestri128 (search, 4, sv, 16, 0);
> -
> -      if (__builtin_expect (index < 16, 0))
> -       goto found;
> -
> -      /* Advance the pointer to an aligned address.  We will re-scan a
> -        few bytes, but we no longer need care for reading past the
> -        end of a page, since we're guaranteed a match.  */
> -      s = (const uchar *)((si + 15) & -16);
> -    }
> -
> -  /* Main loop, processing 16 bytes at a time.  */
> -#ifdef __GCC_ASM_FLAG_OUTPUTS__
> -  while (1)
> +  typedef v16qi v16qi_u __attribute__ ((__aligned__ (1)));
> +  /* Helper vector for pshufb-based matching:
> +     each character C we're searching for is at position (C % 16).  */
> +  v16qi lut = { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' };
> +  static_assert ('\n' == 10 && '\r' == 13 && '\\' == 92 && '?' == 63);
> +
> +  v16qi d1, d2, t1, t2;
> +  /* Unaligned loads.  Reading beyond the final newline is safe,
> +     since files.cc:read_file_guts pads the allocation.  */
> +  d1 = *(const v16qi_u *)s;
> +  d2 = *(const v16qi_u *)(s + 16);
> +  unsigned m1, m2, found;
> +  /* Process two 16-byte chunks per iteration.  */
> +  do
>      {
> -      char f;
> -
> -      /* By using inline assembly instead of the builtin,
> -        we can use the result, as well as the flags set.  */
> -      __asm ("%vpcmpestri\t$0, %2, %3"
> -            : "=c"(index), "=@ccc"(f)
> -            : "m"(*s), "x"(search), "a"(4), "d"(16));
> -      if (f)
> -       break;
> -
> -      s += 16;
> +      t1 = __builtin_ia32_pshufb128 (lut, d1);
> +      t2 = __builtin_ia32_pshufb128 (lut, d2);
> +      m1 = __builtin_ia32_pmovmskb128 (t1 == d1);
> +      m2 = __builtin_ia32_pmovmskb128 (t2 == d2);
> +      s += 32;
> +      d1 = *(const v16qi_u *)s;
> +      d2 = *(const v16qi_u *)(s + 16);
> +      found = m1 + (m2 << 16);
>      }
> -#else
> -  s -= 16;
> -  /* By doing the whole loop in inline assembly,
> -     we can make proper use of the flags set.  */
> -  __asm (      ".balign 16\n"
> -       "0:     add $16, %1\n"
> -       "       %vpcmpestri\t$0, (%1), %2\n"
> -       "       jnc 0b"
> -       : "=&c"(index), "+r"(s)
> -       : "x"(search), "a"(4), "d"(16));
> -#endif
> -
> - found:
> -  return s + index;
> +  while (!found);
> +  /* Prefer to compute 's - 32' here, not spend an extra instruction
> +     to make a copy of the previous value of 's' in the loop.  */
> +  __asm__ ("" : "+r"(s));
> +  return s - 32 + __builtin_ctz (found);
>  }
>
>  #else
> -/* Work around out-dated assemblers without sse4 support.  */
> -#define search_line_sse42 search_line_sse2
> +/* Work around out-dated assemblers without SSSE3 support.  */
> +#define search_line_ssse3 search_line_sse2
>  #endif
>
> +#ifdef __SSSE3__
> +/* No need for CPU probing, just use the best available variant.  */
> +#define search_line_fast search_line_ssse3
> +#else
>  /* Check the CPU capabilities.  */
>
>  #include "../gcc/config/i386/cpuid.h"
>
>  typedef const uchar * (*search_line_fast_type) (const uchar *, const uchar 
> *);
> -static search_line_fast_type search_line_fast;
> +static search_line_fast_type search_line_fast
> +#if defined(__SSE2__)
> + = search_line_sse2;
> +#else
> + = search_line_acc_char;
> +#endif
>
>  #define HAVE_init_vectorized_lexer 1
>  static inline void
>  init_vectorized_lexer (void)
>  {
> -  unsigned dummy, ecx = 0, edx = 0;
> -  search_line_fast_type impl = search_line_acc_char;
> -  int minimum = 0;
> -
> -#if defined(__SSE4_2__)
> -  minimum = 3;
> -#elif defined(__SSE2__)
> -  minimum = 2;
> -#endif
> +  unsigned ax, bx, cx, dx;
>
> -  if (minimum == 3)
> -    impl = search_line_sse42;
> -  else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
> -    {
> -      if (minimum == 3 || (ecx & bit_SSE4_2))
> -        impl = search_line_sse42;
> -      else if (minimum == 2 || (edx & bit_SSE2))
> -       impl = search_line_sse2;
> -    }
> +  if (!__get_cpuid (1, &ax, &bx, &cx, &dx))
> +    return;
>
> -  search_line_fast = impl;
> +  if (cx & bit_SSSE3)
> +    search_line_fast = search_line_ssse3;
> +  else if (dx & bit_SSE2)
> +    search_line_fast = search_line_sse2;
>  }
> +#endif
>
>  #elif (GCC_VERSION >= 4005) && defined(_ARCH_PWR8) && defined(__ALTIVEC__)
>
> --
> 2.44.0
>

Reply via email to