On Fri, May 21, 2021 at 12:30 AM Bernd Edlinger <bernd.edlin...@hotmail.de> wrote: > > > > On 5/21/21 8:57 AM, Richard Biener wrote: > > On Thu, May 20, 2021 at 4:04 PM H.J. Lu <hjl.to...@gmail.com> wrote: > >> > >> On Thu, May 20, 2021 at 12:51 AM Richard Biener > >> <richard.guent...@gmail.com> wrote: > >>> > >>> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.to...@gmail.com> wrote: > >>>> > >>>> On Wed, May 19, 2021 at 2:33 AM Richard Biener > >>>> <richard.guent...@gmail.com> wrote: > >>>>> > >>>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.to...@gmail.com> wrote: > >>>>>> > >>>>>> When expanding a constant constructor, don't call expand_constructor if > >>>>>> it is more efficient to load the data from the memory via move by > >>>>>> pieces. > >>>>>> > >>>>>> gcc/ > >>>>>> > >>>>>> PR middle-end/90773 > >>>>>> * expr.c (expand_expr_real_1): Don't call expand_constructor if > >>>>>> it is more efficient to load the data from the memory. > >>>>>> > >>>>>> gcc/testsuite/ > >>>>>> > >>>>>> PR middle-end/90773 > >>>>>> * gcc.target/i386/pr90773-24.c: New test. > >>>>>> * gcc.target/i386/pr90773-25.c: Likewise. > >>>>>> --- > >>>>>> gcc/expr.c | 10 ++++++++++ > >>>>>> gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++ > >>>>>> gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++ > >>>>>> 3 files changed, 52 insertions(+) > >>>>>> create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c > >>>>>> create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c > >>>>>> > >>>>>> diff --git a/gcc/expr.c b/gcc/expr.c > >>>>>> index d09ee42e262..80e01ea1cbe 100644 > >>>>>> --- a/gcc/expr.c > >>>>>> +++ b/gcc/expr.c > >>>>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, > >>>>>> machine_mode tmode, > >>>>>> unsigned HOST_WIDE_INT ix; > >>>>>> tree field, value; > >>>>>> > >>>>>> + /* Check if it is more efficient to load the data from > >>>>>> + the memory directly. FIXME: How many stores do we > >>>>>> + need here if not moved by pieces? */ > >>>>>> + unsigned HOST_WIDE_INT bytes > >>>>>> + = tree_to_uhwi (TYPE_SIZE_UNIT (type)); > >>>>> > >>>>> that's prone to fail - it could be a VLA. > >>>> > >>>> What do you mean by fail? Is it ICE or missed optimization? > >>>> Do you have a testcase? > >>>> > >>>>> > >>>>>> + if ((bytes / UNITS_PER_WORD) > 2 > >>>>>> + && MOVE_MAX_PIECES > UNITS_PER_WORD > >>>>>> + && can_move_by_pieces (bytes, TYPE_ALIGN (type))) > >>>>>> + goto normal_inner_ref; > >>>>>> + > >>>>> > >>>>> It looks like you're concerned about aggregate copies but this also > >>>>> handles > >>>>> non-aggregates (which on GIMPLE might already be optimized of course). > >>>> > >>>> Here I check if we copy more than 2 words and we can move more than > >>>> a word in a single instruction. > >>>> > >>>>> Also you say "if it's cheaper" but I see no cost considerations. How do > >>>>> we generally handle immed const vs. load from constant pool costs? > >>>> > >>>> This trades 2 (update to 8) stores with one load plus one store. Is > >>>> there > >>>> a way to check which one is faster? > >>> > >>> I'm not sure - it depends on whether the target can do stores from > >>> immediates > >>> at all or what restrictions apply, what the immediate value actually is > >>> (zero or all-ones should be way cheaper than sth arbitrary) and how the > >>> pressure on the load unit is. can_move_by_pieces (bytes, TYPE_ALIGN > >>> (type)) > >>> also does not guarantee it will actually move pieces larger than > >>> UNITS_PER_WORD, > >>> that might depend on alignment. There's by_pieces_ninsns that might > >>> provide > >>> some hint here. > >>> > >>> I'm sure it works well for x86. > >>> > >>> I wonder if the existing code is in the appropriate place and we > >>> shouldn't instead > >>> handle this somewhere upthread where we ask to copy 'exp' into some other > >>> memory location. For your testcase that's expand_assignment but I can > >>> imagine passing array[0] by value to a function resulting in similar > >>> copying. > >>> Testing that shows we get > >>> > >>> pushq array+56(%rip) > >>> .cfi_def_cfa_offset 24 > >>> pushq array+48(%rip) > >>> .cfi_def_cfa_offset 32 > >>> pushq array+40(%rip) > >>> .cfi_def_cfa_offset 40 > >>> pushq array+32(%rip) > >>> .cfi_def_cfa_offset 48 > >>> pushq array+24(%rip) > >>> .cfi_def_cfa_offset 56 > >>> pushq array+16(%rip) > >>> .cfi_def_cfa_offset 64 > >>> pushq array+8(%rip) > >>> .cfi_def_cfa_offset 72 > >>> pushq array(%rip) > >>> .cfi_def_cfa_offset 80 > >>> call bar > >>> > >>> for that. We do have the by-pieces infrastructure to generally do this > >>> kind of > >>> copying but in both of these cases we do not seem to use it. I also > >>> wonder > >>> if the by-pieces infrastructure can pick up constant initializers > >>> automagically > >>> (we could native_encode the initializer part and feed the by-pieces > >>> infrastructure with an array of bytes). There for example might be easy > >>> to > >>> immediate-store byte parts and difficult ones where we could decide on a > >>> case-by-case basis whether to load+store or immediate-store them. > >> > >> I opened: > >> > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704 > >> > >>> For example if I change your testcase to have the array[] initializer > >>> all-zero we currently emit > >>> > >>> pxor %xmm0, %xmm0 > >>> movups %xmm0, (%rdi) > >>> movups %xmm0, 16(%rdi) > >>> movups %xmm0, 32(%rdi) > >>> movups %xmm0, 48(%rdi) > >>> ret > >>> > >>> will your patch cause us to emit 4 loads? OTHO if I do > >>> > >>> const struct S array[] = { > >>> { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 } > >>> }; > >>> > >>> we get > >>> > >>> movq $0, (%rdi) > >>> movl $0, 8(%rdi) > >>> movl $0, 12(%rdi) > >>> movl $7241, 16(%rdi) > >>> ... > >>> > >>> ideally we'd have sth like > >>> > >>> pxor %xmm0, %xmm0 > >>> movups %xmm0, (%rdi) > >>> movaps array+16(%rip), %xmm0 > >>> movups %xmm0, 16(%rdi) > >>> ... > >>> > >>> thus have the zeros written as immediates and the remaining pieces > >>> with load+stores. > >>> > >>> The by-pieces infrastructure eventually get's to see > >>> > >>> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl > >>> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256]) > >>> > >>> where the MEM_EXPR should provide a way to access the constant > >>> initializer. > >>> > >>> That said I do agree the current code is a bit premature optimization > >>> - but maybe > >>> it should be fend off in expand_constructor which has the cheap > >>> clear_storage > >>> first and which already does check can_move_by_pieces with some > >>> heuristics, > >>> but that seems to be guarded by > >>> > >>> || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) > >>> && (! can_move_by_pieces > >>> (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > >>> TYPE_ALIGN (type))) > >>> && ! mostly_zeros_p (exp)))) > >>> > >>> which is odd (we _can_ move by pieces, but how does this apply to > >>> TREE_CONSTANT CTORs and avoid_temp_mem?). > >>> > >>> That said, I wonder if we want to elide expand_constructor when the > >>> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p > >>> and we can_move_by_pieces. > >>> > >>> So sth like > >>> > >>> diff --git a/gcc/expr.c b/gcc/expr.c > >>> index 7139545d543..76b3bdf0c01 100644 > >>> --- a/gcc/expr.c > >>> +++ b/gcc/expr.c > >>> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum > >>> expand_modifier modifier, > >>> && (! can_move_by_pieces > >>> (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > >>> TYPE_ALIGN (type))) > >>> + && ! mostly_zeros_p (exp)) > >>> + || (TREE_CONSTANT (exp) > >>> + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) > >>> + && (can_move_by_pieces > >>> + (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > >>> + TYPE_ALIGN (type))) > >>> && ! mostly_zeros_p (exp)))) > >>> || ((modifier == EXPAND_INITIALIZER || modifier == > >>> EXPAND_CONST_ADDRESS) > >>> && TREE_CONSTANT (exp))) > >>> > >>> which handles your initializer and the all-zero one optimal? > >>> > >> > >> It works. Here is the updated patch. > > > > So just looking at the code again I think we probably want to add > > && avoid_temp_mem here, at least that's the case we're looking > > at. Not sure if we ever arrive with TREE_CONSTANT CTORs > > and !avoid_temp_mem but if so we'd create a temporary here > > which of course would be pointless. > > > > So maybe it's then clearer to split the condition out as > > > > diff --git a/gcc/expr.c b/gcc/expr.c > > index 7139545d543..ee8f25f9abd 100644 > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -8523,6 +8523,19 @@ expand_constructor (tree exp, rtx target, enum > > expand_modifier modifier, > > return constructor; > > } > > > > + /* If the CTOR is available in static storage and not mostly > > + zeros and we can move it by pieces prefer to do so since > > + that's usually more efficient than performing a series of > > + stores from immediates. */ > > + if (avoid_temp_mem > > + && TREE_STATIC (exp) > > + && TREE_CONSTANT (exp) > > + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) > > + && can_move_by_pieces (tree_to_uhwi (TYPE_SIZE_UNIT (type)), > > + TYPE_ALIGN (type)) > > + && ! mostly_zeros_p (exp)) > > + return NULL_RTX; > > + > > /* Handle calls that pass values in multiple non-contiguous > > locations. The Irix 6 ABI has examples of this. */ > > if (target == 0 || ! safe_from_p (target, exp, 1) > > > > > > OK with that change. > > > > Note however (I've been playing with the previous version) > that the test case > > FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t > ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t > ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1 > > fails for --target_board=unix > > $ grep movdqu pr90773-25.s > vmovdqu %xmm0, (%rdi) > vmovdqu %xmm1, 16(%rdi) > vmovdqu %xmm2, 32(%rdi) > vmovdqu %xmm3, 48(%rdi) > > while the test expects %ymm > /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, > \\(%\[\^,\]+\\)" 1 } } */ > /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, > 32\\(%\[\^,\]+\\)" 1 } } */ > > and > > FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-24.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t > ]%ymm[0-9]+, \\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-25.c scan-assembler-times vmovdqu[\\\\t > ]%ymm[0-9]+, 32\\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times pxor[\\\\t > ]%xmm[0-9]+, %xmm[0-9]+ 1 > FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, \\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, 16\\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, 32\\\\(%[^,]+\\\\) 1 > FAIL: gcc.target/i386/pr90773-26.c scan-assembler-times movups[\\\\t > ]%xmm[0-9]+, 48\\\\(%[^,]+\\\\) 1 > > fails for --target_board=unix/-m32 >
The whole patch set is needed. My users/hjl/pieces/hook branch is at https://gitlab.com/x86-gcc/gcc/-/tree/users/hjl/pieces/hook I got [hjl@gnu-cfl-2 testsuite]$ /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/ /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr90773-25.c -fdiagnostics-plain-output -O2 -march=skylake -ffat-lto-objects -fno-ident -S -o pr90773-25.s [hjl@gnu-cfl-2 testsuite]$ cat pr90773-25.s .file "pr90773-25.c" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc vpxor %xmm0, %xmm0, %xmm0 vmovdqu %xmm0, (%rdi) vmovdqu %xmm0, 16(%rdi) vmovdqu %xmm0, 32(%rdi) vmovdqu %xmm0, 48(%rdi) ret .cfi_endproc .LFE0: .size foo, .-foo .globl array .section .rodata .align 32 .type array, @object .size array, 64 array: .zero 64 .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-2 testsuite]$ -- H.J.