On Wed, Mar 9, 2022 at 7:08 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Wed, Mar 9, 2022 at 12:25 AM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > On Tue, Mar 8, 2022 at 4:44 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > On Mon, Mar 7, 2022 at 5:45 AM Richard Biener > > > <richard.guent...@gmail.com> wrote: > > > > > > > > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > > > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote: > > > > > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches > > > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to > > > > > > > fold memcpy. > > > > > > > The default is > > > > > > > > > > > > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun)) > > > > > > > > > > > > > > For x86, it is MOVE_MAX to restore the old behavior before > > > > > > > > > > > > I know we've discussed this to death in the PR, I just want to > > > > > > repeat here > > > > > > that the GIMPLE folding expects to generate a single load and a > > > > > > single > > > > > > store (that is what it does on the GIMPLE level) which is why > > > > > > MOVE_MAX > > > > > > was chosen originally (it's documented to what a "single > > > > > > instruction" does). > > > > > > In practice MOVE_MAX does not seem to cover vector register sizes > > > > > > so Richard pulled MOVE_RATIO which is really intended to cover > > > > > > the case of using multiple instructions for moving memory (but then > > > > > > I > > > > > > don't remember whether for the ARM case the single load/store GIMPLE > > > > > > will be expanded to multiple load/store instructions). > > > > > > > > > > > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution, > > > > > > being very specific for memcpy folding (we also fold memmove btw). > > > > > > > > > > > > There is also MOVE_MAX_PIECES which _might_ be more appropriate > > > > > > than MOVE_MAX here and still honor the idea of single instructions. > > > > > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX, > > > > > > not MOVE_MAX * MOVE_RATIO. > > > > > > > > > > > > So if we need a new hook then that hook should at least get the > > > > > > 'speed' argument of MOVE_RATIO and it should get a better name. > > > > > > > > > > > > I still think that it should be possible to improve the insn check > > > > > > to > > > > > > avoid use of "disabled" modes, maybe that's also a point to add > > > > > > a new hook like .move_with_mode_p or so? To quote, we do > > > > > > > > > > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P. > > > > > > > > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely > > > > mentions "a load or store used TO COPY MEMORY" (emphasis mine) > > > > and whose x86 implementation would already be fine (doing larger moves > > > > and also not doing too large moves). But appearantly the arm folks > > > > decided that that's not fit and instead (mis-?)used MOVE_MAX * > > > > MOVE_RATIO. > > > > > > > > Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces. > > > > Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but > > > > restrict itself to a single load and a single store. > > > > > > > > > > > > > > > > scalar_int_mode mode; > > > > > > if (int_mode_for_size (ilen * 8, 0).exists (&mode) > > > > > > && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * > > > > > > 8 > > > > > > && have_insn_for (SET, mode) > > > > > > /* If the destination pointer is not aligned we > > > > > > must be able > > > > > > to emit an unaligned store. */ > > > > > > && (dest_align >= GET_MODE_ALIGNMENT (mode) > > > > > > || !targetm.slow_unaligned_access (mode, > > > > > > dest_align) > > > > > > || (optab_handler (movmisalign_optab, mode) > > > > > > != CODE_FOR_nothing))) > > > > > > > > > > > > where I understand the ISA is enabled and if the user explicitely > > > > > > uses it that's OK but -mprefer-avx128 should tell GCC to never > > > > > > generate AVX256 code where the user was not explicitely using it > > > > > > (still for example glibc might happily use AVX256 code to implement > > > > > > the memcpy we are folding!) > > > > > > > > > > > > Note the BB vectorizer also might end up with using AVX256 because > > > > > > in places it also relies on optab queries and the > > > > > > vector_mode_supported_p > > > > > > check (but the memcpy folding uses the fake integer modes). So > > > > > > x86 might need to implement the related_mode hook to avoid > > > > > > "auto"-using > > > > > > a larger vector mode which the default implementation would happily > > > > > > do. > > > > > > > > > > > > Richard. > > > > > > > > > > OK for master? > > > > > > > > Looking for opinions from others as well. > > > > > > > > Btw, there's a similar use in expand_DEFERRED_INIT: > > > > > > > > && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT, > > > > 0).exists (&var_mode) > > > > && have_insn_for (SET, var_mode)) > > > > > > > > So it occured to me that maybe targetm.move_with_mode_p should > > > > eventually > > > > > > TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to > > > use var_mode. > > > > > > > check have_insn_for (SET, var_mode) or we should abstract checking the > > > > two > > > > things to a generic API somewhere (in optabs-query.h maybe, or expmed.h, > > > > not sure where it would be more appropriate). > > > > > > > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode > > > > > @var{mode}) > > > > > +This target hook returns true if move with mode @var{mode} can be > > > > > +generated implicitly. The default definition returns true. > > > > > +@end deftypefn > > > > > > > > I know what you mean but I'm not sure "can be generated implicitly" > > > > captures > > > > that. The compiler always generated stuff explicitely. It's also > > > > "should", not "can", no? > > > > > > TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization. > > > It is totally OK for TARGET_MOVE_WITH_MODE_P to return false. Will > > > TARGET_FAST_MOVE_WITH_MODE_P be better? > > > > I thought the PR to address was about -mprefer-avx128 but memcpy using > > YMM regs? So it's not about "fast" it's about fulfilling user > > expectations, no? > > > > As said elsewhere the vectorizer gets to honor -mprefer-avx128 via the > > vector_modes hook to iterate over but also with the related_mode hook > > (which x86 chooses to not implement). It sounds like we need something > > similar for integer modes (aka "move" modes), but then does x86 "honor" > > -mprefer-avx128 when doing *_by_pieces? And how does it do that there? > > TARGET_PREFERRED_MOVE_WITH_MODE_P?
That's an alternative suggestion for the name? Looking at what we have (I'm still trying to avoid yet another hook), there is TARGET_USE_BY_PIECES_INFRASTRUCTURE_P which gets passed length, alignment, kind and speed_p. I think that hook would be the one to tell the middle-end the largest piece (mode or size) to use? GIMPLE folding would then disable itself if that largest mode < size. The default implementation would use MOVE_MAX_PIECES * MOVE_RATIO (instead of MOVE_MAX * MOVE_RATIO) and arn could implement this hook to unleash it to not only allow MAX_MOVE_PIECES but MAX_MOVE_PIECES * MOVE_RATIO here. That also highlights that Richards change was somewhat dubious since a corresponding change in _by_pieces is not done (and following through would then enable this there). That is, it looks like Richards change is a (bad) workaround for GIMPLE folding not considering multiple loads/stores? The hook is implemented by 5 targets right now so the amount of changes should be managable, the obvious default max-mode size is then just {MOVE,STORE,COMPARE}_MAX_PIECES. Given all of the above I'm inclined to revert the offending change and instead change MOVE_MAX to MOVE_MAX_PIECES (because we're doing by-pieces with an additional restriction of 1 piece). Richard. > > Richard. > > > > > > Thanks, > > > > Richard. > > > > > > > > > Thanks. > > > > > > > > > > H.J. > > > > > --- > > > > > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be > > > > > generated implicitly. The default definition returns true. The x86 > > > > > version returns true if the mode size <= MOVE_MAX, which is the max > > > > > number of bytes we can move in one reasonably fast instruction. > > > > > > > > > > gcc/ > > > > > > > > > > PR target/103393 > > > > > * gimple-fold.cc (gimple_fold_builtin_memory_op): Call > > > > > targetm.move_with_mode_p to check if move with mode can be > > > > > generated implicitly. > > > > > * target.def: Add move_with_mode_p. > > > > > * targhooks.cc (default_move_with_mode_p): New. > > > > > * targhooks.h (default_move_with_mode_p): Likewise. > > > > > * config/i386/i386.cc (ix86_move_with_mode_p): New. > > > > > (TARGET_MOVE_WITH_MODE_P): Likewise. > > > > > * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P. > > > > > * doc/tm.texi: Regenerate. > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > PR target/103393 > > > > > * gcc.target/i386/pr103393-1.c: New test. > > > > > * gcc.target/i386/pr103393-2.c: Likewise. > > > > > * gcc.target/i386/pr103393-3.c: Likewise. > > > > > * gcc.target/i386/pr103393-4.c: Likewise. > > > > > * gcc.target/i386/pr103393-5.c: Likewise. > > > > > --- > > > > > gcc/config/i386/i386.cc | 12 ++++++++++++ > > > > > gcc/doc/tm.texi | 5 +++++ > > > > > gcc/doc/tm.texi.in | 2 ++ > > > > > gcc/gimple-fold.cc | 1 + > > > > > gcc/target.def | 7 +++++++ > > > > > gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++ > > > > > gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++ > > > > > gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++ > > > > > gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++ > > > > > gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++ > > > > > 10 files changed, 107 insertions(+) > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c > > > > > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > > > index b2bf90576d5..68a2c59220c 100644 > > > > > --- a/gcc/config/i386/i386.cc > > > > > +++ b/gcc/config/i386/i386.cc > > > > > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes) > > > > > return ROUND_UP (bytes, UNITS_PER_WORD); > > > > > } > > > > > > > > > > +/* Implement the TARGET_MOVE_WITH_MODE_P hook. Return true if move > > > > > + with MODE can be generated implicitly. */ > > > > > + > > > > > +static bool > > > > > +ix86_move_with_mode_p (machine_mode mode) > > > > > +{ > > > > > + return GET_MODE_SIZE (mode) <= MOVE_MAX; > > > > > +} > > > > > + > > > > > /* Target-specific selftests. */ > > > > > > > > > > #if CHECKING_P > > > > > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int > > > > > fcode ATTRIBUTE_UNUSED) > > > > > #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests > > > > > #endif /* #if CHECKING_P */ > > > > > > > > > > +#undef TARGET_MOVE_WITH_MODE_P > > > > > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p > > > > > + > > > > > struct gcc_target targetm = TARGET_INITIALIZER; > > > > > > > > > > #include "gt-i386.h" > > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > > > > > index 49864dd79f8..9d5058610e1 100644 > > > > > --- a/gcc/doc/tm.texi > > > > > +++ b/gcc/doc/tm.texi > > > > > @@ -11924,6 +11924,11 @@ statement holding the function call. > > > > > Returns true if any change > > > > > was made to the GIMPLE stream. > > > > > @end deftypefn > > > > > > > > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode > > > > > @var{mode}) > > > > > +This target hook returns true if move with mode @var{mode} can be > > > > > +generated implicitly. The default definition returns true. > > > > > +@end deftypefn > > > > > + > > > > > @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree > > > > > @var{decl1}, tree @var{decl2}) > > > > > This hook is used to compare the target attributes in two functions > > > > > to > > > > > determine which function's features get higher priority. This is > > > > > used > > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > > > > > index 95e5e341f07..e9158ab90c6 100644 > > > > > --- a/gcc/doc/tm.texi.in > > > > > +++ b/gcc/doc/tm.texi.in > > > > > @@ -7924,6 +7924,8 @@ to by @var{ce_info}. > > > > > > > > > > @hook TARGET_GIMPLE_FOLD_BUILTIN > > > > > > > > > > +@hook TARGET_MOVE_WITH_MODE_P > > > > > + > > > > > @hook TARGET_COMPARE_VERSION_PRIORITY > > > > > > > > > > @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER > > > > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc > > > > > index c9179abb27e..93267eeabb2 100644 > > > > > --- a/gcc/gimple-fold.cc > > > > > +++ b/gcc/gimple-fold.cc > > > > > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op > > > > > (gimple_stmt_iterator *gsi, > > > > > if (int_mode_for_size (ilen * 8, 0).exists (&mode) > > > > > && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8 > > > > > && have_insn_for (SET, mode) > > > > > + && targetm.move_with_mode_p (mode) > > > > > /* If the destination pointer is not aligned we > > > > > must be able > > > > > to emit an unaligned store. */ > > > > > && (dest_align >= GET_MODE_ALIGNMENT (mode) > > > > > diff --git a/gcc/target.def b/gcc/target.def > > > > > index 72c2e1ef756..041d944cb38 100644 > > > > > --- a/gcc/target.def > > > > > +++ b/gcc/target.def > > > > > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.", > > > > > bool, (gimple_stmt_iterator *gsi), > > > > > hook_bool_gsiptr_false) > > > > > > > > > > +DEFHOOK > > > > > +(move_with_mode_p, > > > > > + "This target hook returns true if move with mode @var{mode} can > > > > > be\n\ > > > > > +generated implicitly. The default definition returns true.", > > > > > + bool, (machine_mode mode), > > > > > + hook_bool_mode_true) > > > > > + > > > > > /* Target hook is used to compare the target attributes in two > > > > > functions to > > > > > determine which function's features get higher priority. This is > > > > > used > > > > > during function multi-versioning to figure out the order in which > > > > > two > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c > > > > > b/gcc/testsuite/gcc.target/i386/pr103393-1.c > > > > > new file mode 100644 > > > > > index 00000000000..2091d33c45f > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c > > > > > @@ -0,0 +1,16 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 > > > > > -mprefer-vector-width=128" } */ > > > > > + > > > > > +struct TestData { > > > > > + float arr[8]; > > > > > +}; > > > > > + > > > > > +void > > > > > +cpy (struct TestData *s1, struct TestData *s2 ) > > > > > +{ > > > > > + for(int i=0; i<8; ++i) > > > > > + s1->arr[i] = s2->arr[i]; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! > > > > > ia32 } } } } */ > > > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target > > > > > ia32 } } } */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c > > > > > b/gcc/testsuite/gcc.target/i386/pr103393-2.c > > > > > new file mode 100644 > > > > > index 00000000000..4ad8c8bf379 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c > > > > > @@ -0,0 +1,16 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -march=skylake-avx512" } */ > > > > > + > > > > > +struct TestData { > > > > > + float arr[8]; > > > > > +}; > > > > > + > > > > > +void > > > > > +cpy (struct TestData *s1, struct TestData *s2 ) > > > > > +{ > > > > > + for(int i=0; i<16; ++i) > > > > > + s1->arr[i] = s2->arr[i]; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! > > > > > ia32 } } } } */ > > > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target > > > > > ia32 } } } */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c > > > > > b/gcc/testsuite/gcc.target/i386/pr103393-3.c > > > > > new file mode 100644 > > > > > index 00000000000..7a03160e512 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c > > > > > @@ -0,0 +1,16 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -march=sapphirerapids" } */ > > > > > + > > > > > +struct TestData { > > > > > + float arr[8]; > > > > > +}; > > > > > + > > > > > +void > > > > > +cpy (struct TestData *s1, struct TestData *s2 ) > > > > > +{ > > > > > + for(int i=0; i<16; ++i) > > > > > + s1->arr[i] = s2->arr[i]; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ > > > > > \\t\]+\[^\n\]*%zmm" 2 } } */ > > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c > > > > > b/gcc/testsuite/gcc.target/i386/pr103393-4.c > > > > > new file mode 100644 > > > > > index 00000000000..ae2599f6411 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c > > > > > @@ -0,0 +1,16 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */ > > > > > + > > > > > +struct TestData { > > > > > + float arr[8]; > > > > > +}; > > > > > + > > > > > +void > > > > > +cpy (struct TestData *s1, struct TestData *s2 ) > > > > > +{ > > > > > + for(int i=0; i<16; ++i) > > > > > + s1->arr[i] = s2->arr[i]; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ > > > > > \\t\]+\[^\n\]*%zmm" 2 } } */ > > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c > > > > > b/gcc/testsuite/gcc.target/i386/pr103393-5.c > > > > > new file mode 100644 > > > > > index 00000000000..f9173684212 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c > > > > > @@ -0,0 +1,16 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */ > > > > > + > > > > > +struct TestData { > > > > > + float arr[8]; > > > > > +}; > > > > > + > > > > > +void > > > > > +cpy (struct TestData *s1, struct TestData *s2 ) > > > > > +{ > > > > > + for(int i=0; i<16; ++i) > > > > > + s1->arr[i] = s2->arr[i]; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ > > > > > \\t\]+\[^\n\]*%zmm" 2 } } */ > > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */ > > > > > -- > > > > > 2.35.1 > > > > > > > > > > > > > > > > > -- > > > H.J. > > > > -- > H.J.