On Wed, Mar 9, 2022 at 7:08 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 12:25 AM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Tue, Mar 8, 2022 at 4:44 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > On Mon, Mar 7, 2022 at 5:45 AM Richard Biener
> > > <richard.guent...@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > >
> > > > > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > > > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to 
> > > > > > > fold memcpy.
> > > > > > > The default is
> > > > > > >
> > > > > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > > > > >
> > > > > > > For x86, it is MOVE_MAX to restore the old behavior before
> > > > > >
> > > > > > I know we've discussed this to death in the PR, I just want to 
> > > > > > repeat here
> > > > > > that the GIMPLE folding expects to generate a single load and a 
> > > > > > single
> > > > > > store (that is what it does on the GIMPLE level) which is why 
> > > > > > MOVE_MAX
> > > > > > was chosen originally (it's documented to what a "single 
> > > > > > instruction" does).
> > > > > > In practice MOVE_MAX does not seem to cover vector register sizes
> > > > > > so Richard pulled MOVE_RATIO which is really intended to cover
> > > > > > the case of using multiple instructions for moving memory (but then 
> > > > > > I
> > > > > > don't remember whether for the ARM case the single load/store GIMPLE
> > > > > > will be expanded to multiple load/store instructions).
> > > > > >
> > > > > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > > > > being very specific for memcpy folding (we also fold memmove btw).
> > > > > >
> > > > > > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > > > > than MOVE_MAX here and still honor the idea of single instructions.
> > > > > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > > > > not MOVE_MAX * MOVE_RATIO.
> > > > > >
> > > > > > So if we need a new hook then that hook should at least get the
> > > > > > 'speed' argument of MOVE_RATIO and it should get a better name.
> > > > > >
> > > > > > I still think that it should be possible to improve the insn check 
> > > > > > to
> > > > > > avoid use of "disabled" modes, maybe that's also a point to add
> > > > > > a new hook like .move_with_mode_p or so?  To quote, we do
> > > > >
> > > > > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> > > >
> > > > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > > > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > > > and whose x86 implementation would already be fine (doing larger moves
> > > > and also not doing too large moves).  But appearantly the arm folks
> > > > decided that that's not fit and instead (mis-?)used MOVE_MAX * 
> > > > MOVE_RATIO.
> > > >
> > > > Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
> > > > Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
> > > > restrict itself to a single load and a single store.
> > > >
> > > > > >
> > > > > >               scalar_int_mode mode;
> > > > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 
> > > > > > 8
> > > > > >                   && have_insn_for (SET, mode)
> > > > > >                   /* If the destination pointer is not aligned we 
> > > > > > must be able
> > > > > >                      to emit an unaligned store.  */
> > > > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > > > >                       || !targetm.slow_unaligned_access (mode, 
> > > > > > dest_align)
> > > > > >                       || (optab_handler (movmisalign_optab, mode)
> > > > > >                           != CODE_FOR_nothing)))
> > > > > >
> > > > > > where I understand the ISA is enabled and if the user explicitely
> > > > > > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > > > > > generate AVX256 code where the user was not explicitely using it
> > > > > > (still for example glibc might happily use AVX256 code to implement
> > > > > > the memcpy we are folding!)
> > > > > >
> > > > > > Note the BB vectorizer also might end up with using AVX256 because
> > > > > > in places it also relies on optab queries and the 
> > > > > > vector_mode_supported_p
> > > > > > check (but the memcpy folding uses the fake integer modes).  So
> > > > > > x86 might need to implement the related_mode hook to avoid 
> > > > > > "auto"-using
> > > > > > a larger vector mode which the default implementation would happily 
> > > > > > do.
> > > > > >
> > > > > > Richard.
> > > > >
> > > > > OK for master?
> > > >
> > > > Looking for opinions from others as well.
> > > >
> > > > Btw, there's a similar use in expand_DEFERRED_INIT:
> > > >
> > > >           && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
> > > >                                 0).exists (&var_mode)
> > > >           && have_insn_for (SET, var_mode))
> > > >
> > > > So it occured to me that maybe targetm.move_with_mode_p should 
> > > > eventually
> > >
> > > TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to
> > > use var_mode.
> > >
> > > > check have_insn_for (SET, var_mode) or we should abstract checking the 
> > > > two
> > > > things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
> > > > not sure where it would be more appropriate).
> > > >
> > > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode 
> > > > > @var{mode})
> > > > > +This target hook returns true if move with mode @var{mode} can be
> > > > > +generated implicitly.  The default definition returns true.
> > > > > +@end deftypefn
> > > >
> > > > I know what you mean but I'm not sure "can be generated implicitly" 
> > > > captures
> > > > that.  The compiler always generated stuff explicitely.  It's also
> > > > "should", not "can", no?
> > >
> > > TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization.
> > > It is totally OK for TARGET_MOVE_WITH_MODE_P to return false.   Will
> > > TARGET_FAST_MOVE_WITH_MODE_P be better?
> >
> > I thought the PR to address was about -mprefer-avx128 but memcpy using
> > YMM regs?  So it's not about "fast" it's about fulfilling user 
> > expectations, no?
> >
> > As said elsewhere the vectorizer gets to honor -mprefer-avx128 via the
> > vector_modes hook to iterate over but also with the related_mode hook
> > (which x86 chooses to not implement).  It sounds like we need something
> > similar for integer modes (aka "move" modes), but then does x86 "honor"
> > -mprefer-avx128 when doing *_by_pieces?  And how does it do that there?
>
> TARGET_PREFERRED_MOVE_WITH_MODE_P?

That's an alternative suggestion for the name?  Looking at what we have
(I'm still trying to avoid yet another hook), there is
TARGET_USE_BY_PIECES_INFRASTRUCTURE_P which gets passed
length, alignment, kind and speed_p.  I think that hook would be the one
to tell the middle-end the largest piece (mode or size) to use?

GIMPLE folding would then disable itself if that largest mode < size.
The default implementation would use MOVE_MAX_PIECES * MOVE_RATIO
(instead of MOVE_MAX * MOVE_RATIO) and arn could implement
this hook to unleash it to not only allow MAX_MOVE_PIECES but
MAX_MOVE_PIECES * MOVE_RATIO here.  That also highlights that
Richards change was somewhat dubious since a corresponding change
in _by_pieces is not done (and following through would then enable
this there).  That is, it looks like Richards change is a (bad) workaround
for GIMPLE folding not considering multiple loads/stores?

The hook is implemented by 5 targets right now so the amount of changes
should be managable, the obvious default max-mode size is then
just {MOVE,STORE,COMPARE}_MAX_PIECES.

Given all of the above I'm inclined to revert the offending change and
instead change MOVE_MAX to MOVE_MAX_PIECES (because we're
doing by-pieces with an additional restriction of 1 piece).

Richard.

> > Richard.
> >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > Thanks.
> > > > >
> > > > > H.J.
> > > > > ---
> > > > > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> > > > > generated implicitly.  The default definition returns true.  The x86
> > > > > version returns true if the mode size <= MOVE_MAX, which is the max
> > > > > number of bytes we can move in one reasonably fast instruction.
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR target/103393
> > > > >         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
> > > > >         targetm.move_with_mode_p to check if move with mode can be
> > > > >         generated implicitly.
> > > > >         * target.def: Add move_with_mode_p.
> > > > >         * targhooks.cc (default_move_with_mode_p): New.
> > > > >         * targhooks.h (default_move_with_mode_p): Likewise.
> > > > >         * config/i386/i386.cc (ix86_move_with_mode_p): New.
> > > > >         (TARGET_MOVE_WITH_MODE_P): Likewise.
> > > > >         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
> > > > >         * doc/tm.texi: Regenerate.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > >         PR target/103393
> > > > >         * gcc.target/i386/pr103393-1.c: New test.
> > > > >         * gcc.target/i386/pr103393-2.c: Likewise.
> > > > >         * gcc.target/i386/pr103393-3.c: Likewise.
> > > > >         * gcc.target/i386/pr103393-4.c: Likewise.
> > > > >         * gcc.target/i386/pr103393-5.c: Likewise.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc                    | 12 ++++++++++++
> > > > >  gcc/doc/tm.texi                            |  5 +++++
> > > > >  gcc/doc/tm.texi.in                         |  2 ++
> > > > >  gcc/gimple-fold.cc                         |  1 +
> > > > >  gcc/target.def                             |  7 +++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
> > > > >  10 files changed, 107 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index b2bf90576d5..68a2c59220c 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
> > > > >    return ROUND_UP (bytes, UNITS_PER_WORD);
> > > > >  }
> > > > >
> > > > > +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> > > > > +   with MODE can be generated implicitly.  */
> > > > > +
> > > > > +static bool
> > > > > +ix86_move_with_mode_p (machine_mode mode)
> > > > > +{
> > > > > +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> > > > > +}
> > > > > +
> > > > >  /* Target-specific selftests.  */
> > > > >
> > > > >  #if CHECKING_P
> > > > > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int 
> > > > > fcode ATTRIBUTE_UNUSED)
> > > > >  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
> > > > >  #endif /* #if CHECKING_P */
> > > > >
> > > > > +#undef TARGET_MOVE_WITH_MODE_P
> > > > > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> > > > > +
> > > > >  struct gcc_target targetm = TARGET_INITIALIZER;
> > > > >
> > > > >  #include "gt-i386.h"
> > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > index 49864dd79f8..9d5058610e1 100644
> > > > > --- a/gcc/doc/tm.texi
> > > > > +++ b/gcc/doc/tm.texi
> > > > > @@ -11924,6 +11924,11 @@ statement holding the function call.  
> > > > > Returns true if any change
> > > > >  was made to the GIMPLE stream.
> > > > >  @end deftypefn
> > > > >
> > > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode 
> > > > > @var{mode})
> > > > > +This target hook returns true if move with mode @var{mode} can be
> > > > > +generated implicitly.  The default definition returns true.
> > > > > +@end deftypefn
> > > > > +
> > > > >  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree 
> > > > > @var{decl1}, tree @var{decl2})
> > > > >  This hook is used to compare the target attributes in two functions 
> > > > > to
> > > > >  determine which function's features get higher priority.  This is 
> > > > > used
> > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > index 95e5e341f07..e9158ab90c6 100644
> > > > > --- a/gcc/doc/tm.texi.in
> > > > > +++ b/gcc/doc/tm.texi.in
> > > > > @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
> > > > >
> > > > >  @hook TARGET_GIMPLE_FOLD_BUILTIN
> > > > >
> > > > > +@hook TARGET_MOVE_WITH_MODE_P
> > > > > +
> > > > >  @hook TARGET_COMPARE_VERSION_PRIORITY
> > > > >
> > > > >  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> > > > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > > > > index c9179abb27e..93267eeabb2 100644
> > > > > --- a/gcc/gimple-fold.cc
> > > > > +++ b/gcc/gimple-fold.cc
> > > > > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op 
> > > > > (gimple_stmt_iterator *gsi,
> > > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > > >                   && have_insn_for (SET, mode)
> > > > > +                 && targetm.move_with_mode_p (mode)
> > > > >                   /* If the destination pointer is not aligned we 
> > > > > must be able
> > > > >                      to emit an unaligned store.  */
> > > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > index 72c2e1ef756..041d944cb38 100644
> > > > > --- a/gcc/target.def
> > > > > +++ b/gcc/target.def
> > > > > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
> > > > >   bool, (gimple_stmt_iterator *gsi),
> > > > >   hook_bool_gsiptr_false)
> > > > >
> > > > > +DEFHOOK
> > > > > +(move_with_mode_p,
> > > > > + "This target hook returns true if move with mode @var{mode} can 
> > > > > be\n\
> > > > > +generated implicitly.  The default definition returns true.",
> > > > > + bool, (machine_mode mode),
> > > > > + hook_bool_mode_true)
> > > > > +
> > > > >  /* Target hook is used to compare the target attributes in two 
> > > > > functions to
> > > > >     determine which function's features get higher priority.  This is 
> > > > > used
> > > > >     during function multi-versioning to figure out the order in which 
> > > > > two
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c 
> > > > > b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > > new file mode 100644
> > > > > index 00000000000..2091d33c45f
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 
> > > > > -mprefer-vector-width=128" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<8; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! 
> > > > > ia32 } } } } */
> > > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target 
> > > > > ia32 } } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c 
> > > > > b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > > new file mode 100644
> > > > > index 00000000000..4ad8c8bf379
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! 
> > > > > ia32 } } } } */
> > > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target 
> > > > > ia32 } } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c 
> > > > > b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > > new file mode 100644
> > > > > index 00000000000..7a03160e512
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ 
> > > > > \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c 
> > > > > b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > > new file mode 100644
> > > > > index 00000000000..ae2599f6411
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ 
> > > > > \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c 
> > > > > b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > > new file mode 100644
> > > > > index 00000000000..f9173684212
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ 
> > > > > \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > > --
> > > > > 2.35.1
> > > > >
> > >
> > >
> > >
> > > --
> > > H.J.
>
>
>
> --
> H.J.

Reply via email to