On Tue, Mar 8, 2022 at 4:44 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Mon, Mar 7, 2022 at 5:45 AM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold 
> > > > > memcpy.
> > > > > The default is
> > > > >
> > > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > > >
> > > > > For x86, it is MOVE_MAX to restore the old behavior before
> > > >
> > > > I know we've discussed this to death in the PR, I just want to repeat 
> > > > here
> > > > that the GIMPLE folding expects to generate a single load and a single
> > > > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > > > was chosen originally (it's documented to what a "single instruction" 
> > > > does).
> > > > In practice MOVE_MAX does not seem to cover vector register sizes
> > > > so Richard pulled MOVE_RATIO which is really intended to cover
> > > > the case of using multiple instructions for moving memory (but then I
> > > > don't remember whether for the ARM case the single load/store GIMPLE
> > > > will be expanded to multiple load/store instructions).
> > > >
> > > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > > being very specific for memcpy folding (we also fold memmove btw).
> > > >
> > > > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > > than MOVE_MAX here and still honor the idea of single instructions.
> > > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > > not MOVE_MAX * MOVE_RATIO.
> > > >
> > > > So if we need a new hook then that hook should at least get the
> > > > 'speed' argument of MOVE_RATIO and it should get a better name.
> > > >
> > > > I still think that it should be possible to improve the insn check to
> > > > avoid use of "disabled" modes, maybe that's also a point to add
> > > > a new hook like .move_with_mode_p or so?  To quote, we do
> > >
> > > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> >
> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > and whose x86 implementation would already be fine (doing larger moves
> > and also not doing too large moves).  But appearantly the arm folks
> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> >
> > Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
> > Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
> > restrict itself to a single load and a single store.
> >
> > > >
> > > >               scalar_int_mode mode;
> > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > >                   && have_insn_for (SET, mode)
> > > >                   /* If the destination pointer is not aligned we must 
> > > > be able
> > > >                      to emit an unaligned store.  */
> > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > >                       || !targetm.slow_unaligned_access (mode, 
> > > > dest_align)
> > > >                       || (optab_handler (movmisalign_optab, mode)
> > > >                           != CODE_FOR_nothing)))
> > > >
> > > > where I understand the ISA is enabled and if the user explicitely
> > > > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > > > generate AVX256 code where the user was not explicitely using it
> > > > (still for example glibc might happily use AVX256 code to implement
> > > > the memcpy we are folding!)
> > > >
> > > > Note the BB vectorizer also might end up with using AVX256 because
> > > > in places it also relies on optab queries and the 
> > > > vector_mode_supported_p
> > > > check (but the memcpy folding uses the fake integer modes).  So
> > > > x86 might need to implement the related_mode hook to avoid "auto"-using
> > > > a larger vector mode which the default implementation would happily do.
> > > >
> > > > Richard.
> > >
> > > OK for master?
> >
> > Looking for opinions from others as well.
> >
> > Btw, there's a similar use in expand_DEFERRED_INIT:
> >
> >           && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
> >                                 0).exists (&var_mode)
> >           && have_insn_for (SET, var_mode))
> >
> > So it occured to me that maybe targetm.move_with_mode_p should eventually
>
> TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to
> use var_mode.
>
> > check have_insn_for (SET, var_mode) or we should abstract checking the two
> > things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
> > not sure where it would be more appropriate).
> >
> > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode 
> > > @var{mode})
> > > +This target hook returns true if move with mode @var{mode} can be
> > > +generated implicitly.  The default definition returns true.
> > > +@end deftypefn
> >
> > I know what you mean but I'm not sure "can be generated implicitly" captures
> > that.  The compiler always generated stuff explicitely.  It's also
> > "should", not "can", no?
>
> TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization.
> It is totally OK for TARGET_MOVE_WITH_MODE_P to return false.   Will
> TARGET_FAST_MOVE_WITH_MODE_P be better?

I thought the PR to address was about -mprefer-avx128 but memcpy using
YMM regs?  So it's not about "fast" it's about fulfilling user expectations, no?

As said elsewhere the vectorizer gets to honor -mprefer-avx128 via the
vector_modes hook to iterate over but also with the related_mode hook
(which x86 chooses to not implement).  It sounds like we need something
similar for integer modes (aka "move" modes), but then does x86 "honor"
-mprefer-avx128 when doing *_by_pieces?  And how does it do that there?

Richard.

> > Thanks,
> > Richard.
> >
> > > Thanks.
> > >
> > > H.J.
> > > ---
> > > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> > > generated implicitly.  The default definition returns true.  The x86
> > > version returns true if the mode size <= MOVE_MAX, which is the max
> > > number of bytes we can move in one reasonably fast instruction.
> > >
> > > gcc/
> > >
> > >         PR target/103393
> > >         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
> > >         targetm.move_with_mode_p to check if move with mode can be
> > >         generated implicitly.
> > >         * target.def: Add move_with_mode_p.
> > >         * targhooks.cc (default_move_with_mode_p): New.
> > >         * targhooks.h (default_move_with_mode_p): Likewise.
> > >         * config/i386/i386.cc (ix86_move_with_mode_p): New.
> > >         (TARGET_MOVE_WITH_MODE_P): Likewise.
> > >         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
> > >         * doc/tm.texi: Regenerate.
> > >
> > > gcc/testsuite/
> > >
> > >         PR target/103393
> > >         * gcc.target/i386/pr103393-1.c: New test.
> > >         * gcc.target/i386/pr103393-2.c: Likewise.
> > >         * gcc.target/i386/pr103393-3.c: Likewise.
> > >         * gcc.target/i386/pr103393-4.c: Likewise.
> > >         * gcc.target/i386/pr103393-5.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.cc                    | 12 ++++++++++++
> > >  gcc/doc/tm.texi                            |  5 +++++
> > >  gcc/doc/tm.texi.in                         |  2 ++
> > >  gcc/gimple-fold.cc                         |  1 +
> > >  gcc/target.def                             |  7 +++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
> > >  10 files changed, 107 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index b2bf90576d5..68a2c59220c 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
> > >    return ROUND_UP (bytes, UNITS_PER_WORD);
> > >  }
> > >
> > > +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> > > +   with MODE can be generated implicitly.  */
> > > +
> > > +static bool
> > > +ix86_move_with_mode_p (machine_mode mode)
> > > +{
> > > +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> > > +}
> > > +
> > >  /* Target-specific selftests.  */
> > >
> > >  #if CHECKING_P
> > > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int 
> > > fcode ATTRIBUTE_UNUSED)
> > >  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
> > >  #endif /* #if CHECKING_P */
> > >
> > > +#undef TARGET_MOVE_WITH_MODE_P
> > > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> > > +
> > >  struct gcc_target targetm = TARGET_INITIALIZER;
> > >
> > >  #include "gt-i386.h"
> > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > index 49864dd79f8..9d5058610e1 100644
> > > --- a/gcc/doc/tm.texi
> > > +++ b/gcc/doc/tm.texi
> > > @@ -11924,6 +11924,11 @@ statement holding the function call.  Returns 
> > > true if any change
> > >  was made to the GIMPLE stream.
> > >  @end deftypefn
> > >
> > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode 
> > > @var{mode})
> > > +This target hook returns true if move with mode @var{mode} can be
> > > +generated implicitly.  The default definition returns true.
> > > +@end deftypefn
> > > +
> > >  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree 
> > > @var{decl1}, tree @var{decl2})
> > >  This hook is used to compare the target attributes in two functions to
> > >  determine which function's features get higher priority.  This is used
> > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > index 95e5e341f07..e9158ab90c6 100644
> > > --- a/gcc/doc/tm.texi.in
> > > +++ b/gcc/doc/tm.texi.in
> > > @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
> > >
> > >  @hook TARGET_GIMPLE_FOLD_BUILTIN
> > >
> > > +@hook TARGET_MOVE_WITH_MODE_P
> > > +
> > >  @hook TARGET_COMPARE_VERSION_PRIORITY
> > >
> > >  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > > index c9179abb27e..93267eeabb2 100644
> > > --- a/gcc/gimple-fold.cc
> > > +++ b/gcc/gimple-fold.cc
> > > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator 
> > > *gsi,
> > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > >                   && have_insn_for (SET, mode)
> > > +                 && targetm.move_with_mode_p (mode)
> > >                   /* If the destination pointer is not aligned we must be 
> > > able
> > >                      to emit an unaligned store.  */
> > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index 72c2e1ef756..041d944cb38 100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
> > >   bool, (gimple_stmt_iterator *gsi),
> > >   hook_bool_gsiptr_false)
> > >
> > > +DEFHOOK
> > > +(move_with_mode_p,
> > > + "This target hook returns true if move with mode @var{mode} can be\n\
> > > +generated implicitly.  The default definition returns true.",
> > > + bool, (machine_mode mode),
> > > + hook_bool_mode_true)
> > > +
> > >  /* Target hook is used to compare the target attributes in two functions 
> > > to
> > >     determine which function's features get higher priority.  This is used
> > >     during function multi-versioning to figure out the order in which two
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c 
> > > b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > new file mode 100644
> > > index 00000000000..2091d33c45f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 
> > > -mprefer-vector-width=128" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<8; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 
> > > } } } } */
> > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } 
> > > } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c 
> > > b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > new file mode 100644
> > > index 00000000000..4ad8c8bf379
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 
> > > } } } } */
> > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } 
> > > } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c 
> > > b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > new file mode 100644
> > > index 00000000000..7a03160e512
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 
> > > } } */
> > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c 
> > > b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > new file mode 100644
> > > index 00000000000..ae2599f6411
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 
> > > } } */
> > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c 
> > > b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > new file mode 100644
> > > index 00000000000..f9173684212
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 
> > > } } */
> > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > --
> > > 2.35.1
> > >
>
>
>
> --
> H.J.

Reply via email to