On Thu, May 20, 2021 at 11:57 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Thu, May 20, 2021 at 12:51 AM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > On Wed, May 19, 2021 at 2:33 AM Richard Biener > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > > > > > When expanding a constant constructor, don't call > > > > > > expand_constructor if > > > > > > it is more efficient to load the data from the memory via move by > > > > > > pieces. > > > > > > > > > > > > gcc/ > > > > > > > > > > > > PR middle-end/90773 > > > > > > * expr.c (expand_expr_real_1): Don't call > > > > > > expand_constructor if > > > > > > it is more efficient to load the data from the memory. > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > PR middle-end/90773 > > > > > > * gcc.target/i386/pr90773-24.c: New test. > > > > > > * gcc.target/i386/pr90773-25.c: Likewise. > > > > > > --- > > > > > > gcc/expr.c | 10 ++++++++++ > > > > > > gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 > > > > > > ++++++++++++++++++++++ > > > > > > gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 > > > > > > ++++++++++++++++++++ > > > > > > 3 files changed, 52 insertions(+) > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c > > > > > > > > > > > > diff --git a/gcc/expr.c b/gcc/expr.c > > > > > > index d09ee42e262..80e01ea1cbe 100644 > > > > > > --- a/gcc/expr.c > > > > > > +++ b/gcc/expr.c > > > > > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, > > > > > > machine_mode tmode, > > > > > > unsigned HOST_WIDE_INT ix; > > > > > > tree field, value; > > > > > > > > > > > > + /* Check if it is more efficient to load the data > > > > > > from > > > > > > + the memory directly. FIXME: How many stores do > > > > > > we > > > > > > + need here if not moved by pieces? */ > > > > > > + unsigned HOST_WIDE_INT bytes > > > > > > + = tree_to_uhwi (TYPE_SIZE_UNIT (type)); > > > > > > > > > > that's prone to fail - it could be a VLA. > > > > > > > > What do you mean by fail? Is it ICE or missed optimization? > > > > Do you have a testcase? > > > > > > > > > > > > > > > + if ((bytes / UNITS_PER_WORD) > 2 > > > > > > + && MOVE_MAX_PIECES > UNITS_PER_WORD > > > > > > + && can_move_by_pieces (bytes, TYPE_ALIGN > > > > > > (type))) > > > > > > + goto normal_inner_ref; > > > > > > + > > > > > > > > > > It looks like you're concerned about aggregate copies but this also > > > > > handles > > > > > non-aggregates (which on GIMPLE might already be optimized of course). > > > > > > > > Here I check if we copy more than 2 words and we can move more than > > > > a word in a single instruction. > > > > > > > > > Also you say "if it's cheaper" but I see no cost considerations. How > > > > > do > > > > > we generally handle immed const vs. load from constant pool costs? > > > > > > > > This trades 2 (update to 8) stores with one load plus one store. Is > > > > there > > > > a way to check which one is faster? > > > > > > I'm not sure - it depends on whether the target can do stores from > > > immediates > > > at all or what restrictions apply, what the immediate value actually is > > > (zero or all-ones should be way cheaper than sth arbitrary) and how the > > > pressure on the load unit is. can_move_by_pieces (bytes, TYPE_ALIGN > > > (type)) > > > also does not guarantee it will actually move pieces larger than > > > UNITS_PER_WORD, > > > that might depend on alignment. There's by_pieces_ninsns that might > > > provide > > > some hint here. > > > > > > I'm sure it works well for x86. > > > > > > I wonder if the existing code is in the appropriate place and we > > > shouldn't instead > > > handle this somewhere upthread where we ask to copy 'exp' into some other > > > memory location. For your testcase that's expand_assignment but I can > > > imagine passing array[0] by value to a function resulting in similar > > > copying. > > > Testing that shows we get > > > > > > pushq array+56(%rip) > > > .cfi_def_cfa_offset 24 > > > pushq array+48(%rip) > > > .cfi_def_cfa_offset 32 > > > pushq array+40(%rip) > > > .cfi_def_cfa_offset 40 > > > pushq array+32(%rip) > > > .cfi_def_cfa_offset 48 > > > pushq array+24(%rip) > > > .cfi_def_cfa_offset 56 > > > pushq array+16(%rip) > > > .cfi_def_cfa_offset 64 > > > pushq array+8(%rip) > > > .cfi_def_cfa_offset 72 > > > pushq array(%rip) > > > .cfi_def_cfa_offset 80 > > > call bar > > > > > > for that. We do have the by-pieces infrastructure to generally do this > > > kind of > > > copying but in both of these cases we do not seem to use it. I also > > > wonder > > > if the by-pieces infrastructure can pick up constant initializers > > > automagically > > > (we could native_encode the initializer part and feed the by-pieces > > > infrastructure with an array of bytes). There for example might be easy > > > to > > > immediate-store byte parts and difficult ones where we could decide on a > > > case-by-case basis whether to load+store or immediate-store them. > > > > I opened: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704 > > > > > For example if I change your testcase to have the array[] initializer > > > all-zero we currently emit > > > > > > pxor %xmm0, %xmm0 > > > movups %xmm0, (%rdi) > > > movups %xmm0, 16(%rdi) > > > movups %xmm0, 32(%rdi) > > > movups %xmm0, 48(%rdi) > > > ret > > > > > > will your patch cause us to emit 4 loads? OTHO if I do > > > > > > const struct S array[] = { > > > { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 } > > > }; > > > > > > we get > > > > > > movq $0, (%rdi) > > > movl $0, 8(%rdi) > > > movl $0, 12(%rdi) > > > movl $7241, 16(%rdi) > > > ... > > > > > > ideally we'd have sth like > > > > > > pxor %xmm0, %xmm0 > > > movups %xmm0, (%rdi) > > > movaps array+16(%rip), %xmm0 > > > movups %xmm0, 16(%rdi) > > > ... > > > > > > thus have the zeros written as immediates and the remaining pieces > > > with load+stores. > > > > > > The by-pieces infrastructure eventually get's to see > > > > > > (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl > > > 0x7ffff7ff5b40 array>) [1 array+0 S64 A256]) > > > > > > where the MEM_EXPR should provide a way to access the constant > > > initializer. > > > > > > That said I do agree the current code is a bit premature optimization > > > - but maybe > > > it should be fend off in expand_constructor which has the cheap > > > clear_storage > > > first and which already does check can_move_by_pieces with some > > > heuristics, > > > but that seems to be guarded by > > > > > > || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) > > > && (! can_move_by_pieces > > > (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > > > TYPE_ALIGN (type))) > > > && ! mostly_zeros_p (exp)))) > > > > > > which is odd (we _can_ move by pieces, but how does this apply to > > > TREE_CONSTANT CTORs and avoid_temp_mem?). > > > > > > That said, I wonder if we want to elide expand_constructor when the > > > CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p > > > and we can_move_by_pieces. > > > > > > So sth like > > > > > > diff --git a/gcc/expr.c b/gcc/expr.c > > > index 7139545d543..76b3bdf0c01 100644 > > > --- a/gcc/expr.c > > > +++ b/gcc/expr.c > > > @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum > > > expand_modifier modifier, > > > && (! can_move_by_pieces > > > (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > > > TYPE_ALIGN (type))) > > > + && ! mostly_zeros_p (exp)) > > > + || (TREE_CONSTANT (exp) > > > + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) > > > + && (can_move_by_pieces > > > + (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > > > + TYPE_ALIGN (type))) > > > && ! mostly_zeros_p (exp)))) > > > || ((modifier == EXPAND_INITIALIZER || modifier == > > > EXPAND_CONST_ADDRESS) > > > && TREE_CONSTANT (exp))) > > > > > > which handles your initializer and the all-zero one optimal? > > > > > > > It works. Here is the updated patch. > > So just looking at the code again I think we probably want to add > && avoid_temp_mem here, at least that's the case we're looking > at. Not sure if we ever arrive with TREE_CONSTANT CTORs > and !avoid_temp_mem but if so we'd create a temporary here > which of course would be pointless. > > So maybe it's then clearer to split the condition out as > > diff --git a/gcc/expr.c b/gcc/expr.c > index 7139545d543..ee8f25f9abd 100644 > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum > expand_modifier modifier, > return constructor; > } > > + /* If the CTOR is available in static storage and not mostly > + zeros and we can move it by pieces prefer to do so since > + that's usually more efficient than performing a series of > + stores from immediates. */ > + if (avoid_temp_mem > + && TREE_STATIC (exp) > + && TREE_CONSTANT (exp) > + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) > + && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > + TYPE_ALIGN (type)) > + && ! mostly_zeros_p (exp)) > + return NULL_RTX; > + > /* Handle calls that pass values in multiple non-contiguous > locations. The Irix 6 ABI has examples of this. */ > if (target == 0 || ! safe_from_p (target, exp, 1) > > > OK with that change. >
This is the patch I am checking in after testing. Thanks. -- H.J.
From ca3e15ff233230f2450478459a41e6a93ec42df6 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.to...@gmail.com> Date: Fri, 21 May 2021 05:16:20 -0700 Subject: [PATCH] Elide expand_constructor if move by pieces is preferred Elide expand_constructor when the constructor is static storage and not mostly zeros and we can move it by pieces prefer to do so since that's usually more efficient than performing a series of stores from immediates. 2021-05-21 Richard Biener <rguent...@suse.de> H.J. Lu <hjl.to...@gmail.com> gcc/ PR middle-end/90773 * expr.c (expand_constructor): Elide expand_constructor if move by pieces is preferred. gcc/testsuite/ * gcc.target/i386/pr90773-24.c: New test. * gcc.target/i386/pr90773-25.c: Likewise. --- gcc/expr.c | 13 +++++++++++ gcc/testsuite/gcc.target/i386/pr90773-24.c | 23 ++++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr90773-25.c | 25 ++++++++++++++++++++++ 3 files changed, 61 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c diff --git a/gcc/expr.c b/gcc/expr.c index d09ee42e262..ba61eb98b3b 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier, return constructor; } + /* If the CTOR is available in static storage and not mostly + zeros and we can move it by pieces prefer to do so since + that's usually more efficient than performing a series of + stores from immediates. */ + if (avoid_temp_mem + && TREE_STATIC (exp) + && TREE_CONSTANT (exp) + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) + && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)), + TYPE_ALIGN (type)) + && ! mostly_zeros_p (exp)) + return NULL_RTX; + /* Handle calls that pass values in multiple non-contiguous locations. The Irix 6 ABI has examples of this. */ if (target == 0 || ! safe_from_p (target, exp, 1) diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c b/gcc/testsuite/gcc.target/i386/pr90773-24.c new file mode 100644 index 00000000000..7b2ea66dcfc --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c @@ -0,0 +1,23 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=x86-64" } */ + +struct S +{ + long long s1 __attribute__ ((aligned (8))); + unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14; +}; + +const struct S array[] = { + { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 } +}; + +void +foo (struct S *x) +{ + x[0] = array[0]; +} + +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c b/gcc/testsuite/gcc.target/i386/pr90773-25.c new file mode 100644 index 00000000000..57642ea8d2d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c @@ -0,0 +1,25 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=x86-64" } */ + +struct S +{ + long long s1 __attribute__ ((aligned (8))); + unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14; +}; + +const struct S array[] = { + { 0, } +}; + +void +foo (struct S *x) +{ + x[0] = array[0]; +} + +/* { dg-final { scan-assembler-not "movdqa" } } */ +/* { dg-final { scan-assembler-times "pxor\[\\t \]%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 32\\(%\[\^,\]+\\)" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 48\\(%\[\^,\]+\\)" 1 } } */ -- 2.31.1