On Fri, Apr 23, 2021 at 1:35 AM H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > For op_by_pieces operations between two areas of memory on non-strict > alignment target, add -foverlap-op-by-pieces=[off|on|max-memset] to > generate overlapping operations to minimize number of operations if it > is not a stack push which must not overlap. > > When operating on LENGTH bytes of memory, -foverlap-op-by-pieces=on > starts with the widest usable integer size, MAX_SIZE, for LENGTH bytes > and finishes with the smallest usable integer size, MIN_SIZE, for the > remaining bytes where MAX_SIZE >= MIN_SIZE. If MIN_SIZE > the remaining > bytes, the last operation is performed on MIN_SIZE bytes of overlapping > memory from the previous operation. > > For memset with non-zero byte, -foverlap-op-by-pieces=max-memset generates > an overlapping fill with MAX_SIZE if the number of the remaining bytes is > greater than one. > > Tested on Linux/x86-64 with both -foverlap-op-by-pieces enabled and > disabled by default.
Neither the user documentation nor the patch description tells me what "generate overlapping operations" does. I _suspect_ it's doing an offset adjusted read/write of the last piece of a memory region to avoid doing more than one smaller operations. Thus for a region of size 7 and 4-byte granular ops you'd do operations at offset 0 and 3 rather than one at 0, a two-byte at offset 4 and a one-byte at offset 7. When the tail is of power-of-two size you still generate non-overlapping ops? For memmove there's a correctness issue so you have to make sure to first load the last two ops before performing the stores which increases register pressure. I'm not sure we want a -f option to control this - not all targets will be able to support this. So I'd use a target hook or rather extend the existing use_by_pieces_infrastructure_p hook with an alternate return (some flags bitmask I guess). We do have one extra target hook, compare_by_pieces_branch_ratio, so by that using an alternate hook might be also OK. Adding a -m option in targets that want this user-controllable would be OK of course. Richard. > gcc/ > > PR middl-end/90773 > * common.opt (-foverlap-op-by-pieces): New. > * expr.c (by_pieces_ninsns): If -foverlap-op-by-pieces is enabled, > round up size and alignment to the widest integer mode for maximum > size > (op_by_pieces_d): Add get_usable_mode, m_push and > m_non_zero_memset. > (op_by_pieces_d::op_by_pieces_d): Add 2 bool arguments to > initialize m_push and m_non_zero_memset. > (op_by_pieces_d::get_usable_mode): New. > (op_by_pieces_d::run): Use get_usable_mode to get the largest > usable integer mode and generate overlapping operations for > -foverlap-op-by-pieces. > (PUSHG_P): New. > (move_by_pieces_d::move_by_pieces_d): Updated for op_by_pieces_d > change. > (store_by_pieces_d::store_by_pieces_d): Likewise. > (clear_by_pieces): Likewsie. > * toplev.c (process_options): Issue an error when > -foverlap-op-by-pieces is used for strict alignment target. > * doc/invoke.texi: Document -foverlap-op-by-pieces. > > gcc/testsuite/ > > PR middl-end/90773 > * g++.dg/pr90773-1.h: New test. > * g++.dg/pr90773-1a.C: Likewise. > * g++.dg/pr90773-1b.C: Likewise. > * g++.dg/pr90773-1c.C: Likewise. > * g++.dg/pr90773-1d.C: Likewise. > * gcc.target/i386/pr90773-1.c: Likewise. > * gcc.target/i386/pr90773-2.c: Likewise. > * gcc.target/i386/pr90773-3.c: Likewise. > * gcc.target/i386/pr90773-4.c: Likewise. > * gcc.target/i386/pr90773-5.c: Likewise. > * gcc.target/i386/pr90773-6.c: Likewise. > * gcc.target/i386/pr90773-7.c: Likewise. > * gcc.target/i386/pr90773-8.c: Likewise. > * gcc.target/i386/pr90773-9.c: Likewise. > * gcc.target/i386/pr90773-10.c: Likewise. > * gcc.target/i386/pr90773-11.c: Likewise. > --- > gcc/common.opt | 19 +++ > gcc/doc/invoke.texi | 14 ++ > gcc/expr.c | 159 ++++++++++++++++----- > gcc/testsuite/g++.dg/pr90773-1.h | 14 ++ > gcc/testsuite/g++.dg/pr90773-1a.C | 13 ++ > gcc/testsuite/g++.dg/pr90773-1b.C | 5 + > gcc/testsuite/g++.dg/pr90773-1c.C | 5 + > gcc/testsuite/g++.dg/pr90773-1d.C | 19 +++ > gcc/testsuite/gcc.target/i386/pr90773-1.c | 17 +++ > gcc/testsuite/gcc.target/i386/pr90773-10.c | 13 ++ > gcc/testsuite/gcc.target/i386/pr90773-11.c | 13 ++ > gcc/testsuite/gcc.target/i386/pr90773-2.c | 20 +++ > gcc/testsuite/gcc.target/i386/pr90773-3.c | 23 +++ > gcc/testsuite/gcc.target/i386/pr90773-4.c | 13 ++ > gcc/testsuite/gcc.target/i386/pr90773-5.c | 13 ++ > gcc/testsuite/gcc.target/i386/pr90773-6.c | 11 ++ > gcc/testsuite/gcc.target/i386/pr90773-7.c | 11 ++ > gcc/testsuite/gcc.target/i386/pr90773-8.c | 13 ++ > gcc/testsuite/gcc.target/i386/pr90773-9.c | 13 ++ > gcc/toplev.c | 8 ++ > 20 files changed, 383 insertions(+), 33 deletions(-) > create mode 100644 gcc/testsuite/g++.dg/pr90773-1.h > create mode 100644 gcc/testsuite/g++.dg/pr90773-1a.C > create mode 100644 gcc/testsuite/g++.dg/pr90773-1b.C > create mode 100644 gcc/testsuite/g++.dg/pr90773-1c.C > create mode 100644 gcc/testsuite/g++.dg/pr90773-1d.C > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-10.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-11.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-2.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-3.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-4.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-5.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-6.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-7.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-8.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-9.c > > diff --git a/gcc/common.opt b/gcc/common.opt > index a75b44ee47e..7f5b38c7810 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2123,6 +2123,25 @@ foptimize-sibling-calls > Common Var(flag_optimize_sibling_calls) Optimization > Optimize sibling and tail recursive calls. > > +foverlap-op-by-pieces > +Common RejectNegative Alias(foverlap-op-by-pieces=,on) > + > +foverlap-op-by-pieces= > +Common Joined RejectNegative Enum(overlap_op_by_pieces) > Var(flag_overlap_op_by_pieces) Init(0) > +-foverlap-op-by-pieces=[off|on|max-memset] Generate overlapping > operations between two areas of memory. > + > +Enum > +Name(overlap_op_by_pieces) Type(int) > + > +EnumValue > +Enum(overlap_op_by_pieces) String(off) Value(0) > + > +EnumValue > +Enum(overlap_op_by_pieces) String(on) Value(1) > + > +EnumValue > +Enum(overlap_op_by_pieces) String(max-memset) Value(2) > + > fpartial-inlining > Common Var(flag_partial_inlining) Optimization > Perform partial inlining. > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index e98b0962b9f..dbdd1095216 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -530,6 +530,7 @@ Objective-C and Objective-C++ Dialects}. > -fno-sched-spec -fno-signed-zeros @gol > -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol > -fomit-frame-pointer -foptimize-sibling-calls @gol > +-foverlap-op-by-pieces=@r{[}off@r{|}on@r{|}max-memset@r{]} @gol > -fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > -fprefetch-loop-arrays @gol > -fprofile-correction @gol > @@ -10360,6 +10361,19 @@ their @code{_FORTIFY_SOURCE} counterparts into > faster alternatives. > > Enabled at levels @option{-O2}, @option{-O3}. > > +@item -foverlap-op-by-pieces=@r{[}off@r{|}on@r{|}max-memset@r{]} > +@opindex -foverlap-op-by-pieces > +The value @code{on} tells the compiler to generate overlapping > +operations between two areas of memory by using the largest integer > +operation to minimize number of operations if it is not a stack push. > +The value @code{max-memset} tells the compiler to generate an > +overlapping fill with non-zero byte in the maximum single fill size > +if the last fill size is greater than one. The value @code{off} > +turns off this optimization. > + > +This option is only valid for targets which do not require strict > +alignment. > + > @item -fno-inline > @opindex fno-inline > @opindex finline > diff --git a/gcc/expr.c b/gcc/expr.c > index a0e19465965..375a5497309 100644 > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -815,12 +815,27 @@ by_pieces_ninsns (unsigned HOST_WIDE_INT l, unsigned > int align, > unsigned int max_size, by_pieces_operation op) > { > unsigned HOST_WIDE_INT n_insns = 0; > + scalar_int_mode mode; > + > + if (flag_overlap_op_by_pieces && op != COMPARE_BY_PIECES) > + { > + /* NB: Round up L and ALIGN to the widest integer mode for > + MAX_SIZE. */ > + mode = widest_int_mode_for_size (max_size); > + if (optab_handler (mov_optab, mode) != CODE_FOR_nothing) > + { > + unsigned HOST_WIDE_INT up = ROUND_UP (l, GET_MODE_SIZE (mode)); > + if (up > l) > + l = up; > + align = GET_MODE_ALIGNMENT (mode); > + } > + } > > align = alignment_for_piecewise_move (MOVE_MAX_PIECES, align); > > while (max_size > 1 && l > 0) > { > - scalar_int_mode mode = widest_int_mode_for_size (max_size); > + mode = widest_int_mode_for_size (max_size); > enum insn_code icode; > > unsigned int modesize = GET_MODE_SIZE (mode); > @@ -1041,6 +1056,9 @@ pieces_addr::maybe_postinc (HOST_WIDE_INT size) > > class op_by_pieces_d > { > + private: > + scalar_int_mode get_usable_mode (scalar_int_mode mode, unsigned int); > + > protected: > pieces_addr m_to, m_from; > unsigned HOST_WIDE_INT m_len; > @@ -1048,6 +1066,10 @@ class op_by_pieces_d > unsigned int m_align; > unsigned int m_max_size; > bool m_reverse; > + /* True if this is a stash push. */ > + bool m_push; > + /* True if this memset with non-zero byte. */ > + bool m_non_zero_memset; > > /* Virtual functions, overriden by derived classes for the specific > operation. */ > @@ -1059,7 +1081,7 @@ class op_by_pieces_d > > public: > op_by_pieces_d (rtx, bool, rtx, bool, by_pieces_constfn, void *, > - unsigned HOST_WIDE_INT, unsigned int); > + unsigned HOST_WIDE_INT, unsigned int, bool, bool); > void run (); > }; > > @@ -1074,10 +1096,12 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, > by_pieces_constfn from_cfn, > void *from_cfn_data, > unsigned HOST_WIDE_INT len, > - unsigned int align) > + unsigned int align, bool push, > + bool non_zero_memset) > : m_to (to, to_load, NULL, NULL), > m_from (from, from_load, from_cfn, from_cfn_data), > - m_len (len), m_max_size (MOVE_MAX_PIECES + 1) > + m_len (len), m_max_size (MOVE_MAX_PIECES + 1), > + m_push (push), m_non_zero_memset (non_zero_memset) > { > int toi = m_to.get_addr_inc (); > int fromi = m_from.get_addr_inc (); > @@ -1108,6 +1132,25 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, > m_align = align; > } > > +/* This function returns the largest usable integer mode for LEN bytes > + whose size is no bigger than size of MODE. */ > + > +scalar_int_mode > +op_by_pieces_d::get_usable_mode (scalar_int_mode mode, unsigned int len) > +{ > + unsigned int size; > + do > + { > + size = GET_MODE_SIZE (mode); > + if (len >= size && prepare_mode (mode, m_align)) > + break; > + /* NB: widest_int_mode_for_size checks SIZE > 1. */ > + mode = widest_int_mode_for_size (size); > + } > + while (1); > + return mode; > +} > + > /* This function contains the main loop used for expanding a block > operation. First move what we can in the largest integer mode, > then go to successively smaller modes. For every access, call > @@ -1116,42 +1159,80 @@ op_by_pieces_d::op_by_pieces_d (rtx to, bool to_load, > void > op_by_pieces_d::run () > { > - while (m_max_size > 1 && m_len > 0) > + if (m_len == 0) > + return; > + > + /* NB: widest_int_mode_for_size checks M_MAX_SIZE > 1. */ > + scalar_int_mode mode = widest_int_mode_for_size (m_max_size); > + mode = get_usable_mode (mode, m_len); > + > + do > { > - scalar_int_mode mode = widest_int_mode_for_size (m_max_size); > + unsigned int size = GET_MODE_SIZE (mode); > + rtx to1 = NULL_RTX, from1; > > - if (prepare_mode (mode, m_align)) > + while (m_len >= size) > { > - unsigned int size = GET_MODE_SIZE (mode); > - rtx to1 = NULL_RTX, from1; > + if (m_reverse) > + m_offset -= size; > > - while (m_len >= size) > - { > - if (m_reverse) > - m_offset -= size; > + to1 = m_to.adjust (mode, m_offset); > + from1 = m_from.adjust (mode, m_offset); > > - to1 = m_to.adjust (mode, m_offset); > - from1 = m_from.adjust (mode, m_offset); > + m_to.maybe_predec (-(HOST_WIDE_INT)size); > + m_from.maybe_predec (-(HOST_WIDE_INT)size); > > - m_to.maybe_predec (-(HOST_WIDE_INT)size); > - m_from.maybe_predec (-(HOST_WIDE_INT)size); > + generate (to1, from1, mode); > > - generate (to1, from1, mode); > + m_to.maybe_postinc (size); > + m_from.maybe_postinc (size); > > - m_to.maybe_postinc (size); > - m_from.maybe_postinc (size); > + if (!m_reverse) > + m_offset += size; > > - if (!m_reverse) > - m_offset += size; > + m_len -= size; > + } > > - m_len -= size; > - } > + finish_mode (mode); > > - finish_mode (mode); > - } > + if (m_len == 0) > + return; > > - m_max_size = GET_MODE_SIZE (mode); > + if (!m_push && flag_overlap_op_by_pieces) > + { > + /* NB: Generate overlapping operations if it is not a stack > + push since stack push must not overlap. */ > + if (m_len == 1 > + || !m_non_zero_memset > + || flag_overlap_op_by_pieces < 2) > + { > + /* If the remaining length is 1, this is not memset with > + non-zero byte or max-memset isn't enabled, get the > + smallest integer mode for M_LEN bytes. */ > + mode = smallest_int_mode_for_size (m_len * BITS_PER_UNIT); > + mode = get_usable_mode (mode, GET_MODE_SIZE (mode)); > + } > + int gap = GET_MODE_SIZE (mode) - m_len; > + if (gap > 0) > + { > + /* If size of MODE > M_LEN, generate the last operation > + in MODE for the remaining bytes with ovelapping memory > + from the previois operation. */ > + if (m_reverse) > + m_offset += gap; > + else > + m_offset -= gap; > + m_len += gap; > + } > + } > + else > + { > + /* NB: widest_int_mode_for_size checks SIZE > 1. */ > + mode = widest_int_mode_for_size (size); > + mode = get_usable_mode (mode, m_len); > + } > } > + while (1); > > /* The code above should have handled everything. */ > gcc_assert (!m_len); > @@ -1160,6 +1241,12 @@ op_by_pieces_d::run () > /* Derived class from op_by_pieces_d, providing support for block move > operations. */ > > +#ifdef PUSH_ROUNDING > +#define PUSHG_P(to) ((to) == nullptr) > +#else > +#define PUSHG_P(to) false > +#endif > + > class move_by_pieces_d : public op_by_pieces_d > { > insn_gen_fn m_gen_fun; > @@ -1169,7 +1256,8 @@ class move_by_pieces_d : public op_by_pieces_d > public: > move_by_pieces_d (rtx to, rtx from, unsigned HOST_WIDE_INT len, > unsigned int align) > - : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align) > + : op_by_pieces_d (to, false, from, true, NULL, NULL, len, align, > + PUSHG_P (to), false) > { > } > rtx finish_retmode (memop_ret); > @@ -1263,8 +1351,10 @@ class store_by_pieces_d : public op_by_pieces_d > > public: > store_by_pieces_d (rtx to, by_pieces_constfn cfn, void *cfn_data, > - unsigned HOST_WIDE_INT len, unsigned int align) > - : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, align) > + unsigned HOST_WIDE_INT len, unsigned int align, > + bool non_zero_memset) > + : op_by_pieces_d (to, false, NULL_RTX, true, cfn, cfn_data, len, > + align, false, non_zero_memset) > { > } > rtx finish_retmode (memop_ret); > @@ -1411,7 +1501,8 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len, > memsetp ? SET_BY_PIECES : STORE_BY_PIECES, > optimize_insn_for_speed_p ())); > > - store_by_pieces_d data (to, constfun, constfundata, len, align); > + store_by_pieces_d data (to, constfun, constfundata, len, align, > + memsetp); > data.run (); > > if (retmode != RETURN_BEGIN) > @@ -1438,7 +1529,8 @@ clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, > unsigned int align) > if (len == 0) > return; > > - store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align); > + store_by_pieces_d data (to, clear_by_pieces_1, NULL, len, align, > + false); > data.run (); > } > > @@ -1460,7 +1552,8 @@ class compare_by_pieces_d : public op_by_pieces_d > compare_by_pieces_d (rtx op0, rtx op1, by_pieces_constfn op1_cfn, > void *op1_cfn_data, HOST_WIDE_INT len, int align, > rtx_code_label *fail_label) > - : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, > align) > + : op_by_pieces_d (op0, true, op1, true, op1_cfn, op1_cfn_data, len, > + align, false, false) > { > m_fail_label = fail_label; > } > diff --git a/gcc/testsuite/g++.dg/pr90773-1.h > b/gcc/testsuite/g++.dg/pr90773-1.h > new file mode 100644 > index 00000000000..abdb78b078b > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pr90773-1.h > @@ -0,0 +1,14 @@ > +class fixed_wide_int_storage { > +public: > + long val[10]; > + int len; > + fixed_wide_int_storage () > + { > + len = sizeof (val) / sizeof (val[0]); > + for (int i = 0; i < len; i++) > + val[i] = i; > + } > +}; > + > +extern void foo (fixed_wide_int_storage); > +extern int record_increment(void); > diff --git a/gcc/testsuite/g++.dg/pr90773-1a.C > b/gcc/testsuite/g++.dg/pr90773-1a.C > new file mode 100644 > index 00000000000..3ab8d929f74 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pr90773-1a.C > @@ -0,0 +1,13 @@ > +// { dg-do compile } > +// { dg-options "-O2" } > +// { dg-additional-options "-mno-avx -msse2 -mtune=skylake" { target { > i?86-*-* x86_64-*-* } } } > + > +#include "pr90773-1.h" > + > +int > +record_increment(void) > +{ > + fixed_wide_int_storage x; > + foo (x); > + return 0; > +} > diff --git a/gcc/testsuite/g++.dg/pr90773-1b.C > b/gcc/testsuite/g++.dg/pr90773-1b.C > new file mode 100644 > index 00000000000..9713b2dd612 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pr90773-1b.C > @@ -0,0 +1,5 @@ > +// { dg-do compile } > +// { dg-options "-O2" } > +// { dg-additional-options "-mno-avx512f -march=skylake" { target { i?86-*-* > x86_64-*-* } } } > + > +#include "pr90773-1a.C" > diff --git a/gcc/testsuite/g++.dg/pr90773-1c.C > b/gcc/testsuite/g++.dg/pr90773-1c.C > new file mode 100644 > index 00000000000..699357a88dc > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pr90773-1c.C > @@ -0,0 +1,5 @@ > +// { dg-do compile } > +// { dg-options "-O2" } > +// { dg-additional-options "-march=skylake-avx512" { target { i?86-*-* > x86_64-*-* } } } > + > +#include "pr90773-1a.C" > diff --git a/gcc/testsuite/g++.dg/pr90773-1d.C > b/gcc/testsuite/g++.dg/pr90773-1d.C > new file mode 100644 > index 00000000000..bf9d8543c1b > --- /dev/null > +++ b/gcc/testsuite/g++.dg/pr90773-1d.C > @@ -0,0 +1,19 @@ > +// { dg-do run } > +// { dg-options "-O2" } > +// { dg-additional-options "-march=native" { target { i?86-*-* x86_64-*-* } > } } > +// { dg-additional-sources "pr90773-1a.C" } > + > +#include "pr90773-1.h" > + > +void > +foo (fixed_wide_int_storage x) > +{ > + for (int i = 0; i < x.len; i++) > + if (x.val[i] != i) > + __builtin_abort (); > +} > + > +int main () > +{ > + return record_increment (); > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-1.c > b/gcc/testsuite/gcc.target/i386/pr90773-1.c > new file mode 100644 > index 00000000000..86fec27dad0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-1.c > @@ -0,0 +1,17 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */ > + > +extern char *dst, *src; > + > +void > +foo (void) > +{ > + __builtin_memcpy (dst, src, 15); > +} > + > +/* { dg-final { scan-assembler-times "movq\[\\t \]+\\(%\[\^,\]+\\)," 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "movq\[\\t \]+7\\(%\[\^,\]+\\)," 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+11\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-10.c > b/gcc/testsuite/gcc.target/i386/pr90773-10.c > new file mode 100644 > index 00000000000..5985877cc10 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-10.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ > + > +extern char *dst; > + > +void > +foo (int c) > +{ > + __builtin_memset (dst, c, 5); > +} > + > +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } > } */ > +/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } > } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-11.c > b/gcc/testsuite/gcc.target/i386/pr90773-11.c > new file mode 100644 > index 00000000000..9bf57aa3a44 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-11.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ > + > +extern char *dst; > + > +void > +foo (int c) > +{ > + __builtin_memset (dst, c, 6); > +} > + > +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } > } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, 2\\(%\[\^,\]+\\)" 1 } > } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-2.c > b/gcc/testsuite/gcc.target/i386/pr90773-2.c > new file mode 100644 > index 00000000000..ebdf9dac6e8 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-2.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */ > +/* { dg-additional-options "-mno-avx -msse2" { target { ! ia32 } } } */ > +/* { dg-additional-options "-mno-sse" { target ia32 } } */ > + > +extern char *dst, *src; > + > +void > +foo (void) > +{ > + __builtin_memcpy (dst, src, 19); > +} > + > +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+15\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-3.c > b/gcc/testsuite/gcc.target/i386/pr90773-3.c > new file mode 100644 > index 00000000000..d876f878f60 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-3.c > @@ -0,0 +1,23 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces" } */ > +/* { dg-additional-options "-mno-avx -msse2" { target { ! ia32 } } } */ > +/* { dg-additional-options "-mno-sse" { target ia32 } } */ > + > +extern char *dst, *src; > + > +void > +foo (void) > +{ > + __builtin_memcpy (dst, src, 31); > +} > + > +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+\\(%\[\^,\]+\\)," 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "movdqu\[\\t \]+15\\(%\[\^,\]+\\)," 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+4\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+8\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+12\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+16\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+20\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+24\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "movl\[\\t \]+27\\(%\[\^,\]+\\)," 1 { > target ia32 } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-4.c > b/gcc/testsuite/gcc.target/i386/pr90773-4.c > new file mode 100644 > index 00000000000..0df1b2fc247 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-4.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" > } */ > + > +extern char *dst; > + > +void > +foo (void) > +{ > + __builtin_memset (dst, 0, 31); > +} > + > +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, > \\(%\[\^,\]+\\)" 1 } } */ > +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, > 15\\(%\[\^,\]+\\)" 1 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-5.c > b/gcc/testsuite/gcc.target/i386/pr90773-5.c > new file mode 100644 > index 00000000000..65c9fe88696 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-5.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" > } */ > + > +extern char *dst; > + > +void > +foo (void) > +{ > + __builtin_memset (dst, 0, 21); > +} > + > +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, > \\(%\[\^,\]+\\)" 1 } } */ > +/* { dg-final { scan-assembler-times "movq\[\\t \]+\\\$0+, > 13\\(%\[\^,\]+\\)" 1 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-6.c > b/gcc/testsuite/gcc.target/i386/pr90773-6.c > new file mode 100644 > index 00000000000..0c84d492974 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-6.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic -foverlap-op-by-pieces" > } */ > + > +void > +foo (char *dst, char *src) > +{ > + __builtin_memcpy (dst, src, 255); > +} > + > +/* { dg-final { scan-assembler-times "movdqu\[\\t > \]+\[0-9\]*\\(%\[\^,\]+\\)," 16 } } */ > +/* { dg-final { scan-assembler-not "mov\[bwlq\]" } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-7.c > b/gcc/testsuite/gcc.target/i386/pr90773-7.c > new file mode 100644 > index 00000000000..732b4d3d992 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-7.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O2 -mno-avx -msse2 -mtune=skylake -foverlap-op-by-pieces" > } */ > + > +void > +foo (char *dst) > +{ > + __builtin_memset (dst, 0, 255); > +} > + > +/* { dg-final { scan-assembler-times "movups\[\\t \]+%xmm\[0-9\]+, > \[0-9\]*\\(%\[\^,\]+\\)" 16 } } */ > +/* { dg-final { scan-assembler-not "mov\[bwlq\]" } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-8.c > b/gcc/testsuite/gcc.target/i386/pr90773-8.c > new file mode 100644 > index 00000000000..7ff5ba12daf > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-8.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ > + > +extern char *dst; > + > +void > +foo (void) > +{ > + __builtin_memset (dst, 0, 5); > +} > + > +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } > } */ > +/* { dg-final { scan-assembler-times "movb\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } > } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-9.c > b/gcc/testsuite/gcc.target/i386/pr90773-9.c > new file mode 100644 > index 00000000000..c2fc3ba59a7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr90773-9.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=generic -foverlap-op-by-pieces=max-memset" } */ > + > +extern char *dst; > + > +void > +foo (void) > +{ > + __builtin_memset (dst, 0, 6); > +} > + > +/* { dg-final { scan-assembler-times "movl\[\\t \]+.+, \\(%\[\^,\]+\\)" 1 } > } */ > +/* { dg-final { scan-assembler-times "movw\[\\t \]+.+, 4\\(%\[\^,\]+\\)" 1 } > } */ > diff --git a/gcc/toplev.c b/gcc/toplev.c > index d8cc254adef..23c88c788a2 100644 > --- a/gcc/toplev.c > +++ b/gcc/toplev.c > @@ -1323,6 +1323,14 @@ process_options (void) > } > } > > + if (flag_overlap_op_by_pieces && STRICT_ALIGNMENT) > + { > + error_at (UNKNOWN_LOCATION, > + "%<-foverlap-op-by-pieces%> is not supported for " > + "strict alignment target"); > + flag_overlap_op_by_pieces = 0; > + } > + > /* One region RA really helps to decrease the code size. */ > if (flag_ira_region == IRA_REGION_AUTODETECT) > flag_ira_region > -- > 2.30.2 >