On 5/20/21 4:03 PM, H.J. Lu wrote: > On Thu, May 20, 2021 at 12:51 AM Richard Biener > <richard.guent...@gmail.com> wrote: >> >> On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.to...@gmail.com> wrote: >>> >>> On Wed, May 19, 2021 at 2:33 AM Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> >>>> On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.to...@gmail.com> wrote: >>>>> >>>>> When expanding a constant constructor, don't call expand_constructor if >>>>> it is more efficient to load the data from the memory via move by pieces. >>>>> >>>>> gcc/ >>>>> >>>>> PR middle-end/90773 >>>>> * expr.c (expand_expr_real_1): Don't call expand_constructor if >>>>> it is more efficient to load the data from the memory. >>>>> >>>>> gcc/testsuite/ >>>>> >>>>> PR middle-end/90773 >>>>> * gcc.target/i386/pr90773-24.c: New test. >>>>> * gcc.target/i386/pr90773-25.c: Likewise. >>>>> --- >>>>> gcc/expr.c | 10 ++++++++++ >>>>> gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++ >>>>> gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++ >>>>> 3 files changed, 52 insertions(+) >>>>> create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c >>>>> create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c >>>>> >>>>> diff --git a/gcc/expr.c b/gcc/expr.c >>>>> index d09ee42e262..80e01ea1cbe 100644 >>>>> --- a/gcc/expr.c >>>>> +++ b/gcc/expr.c >>>>> @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, >>>>> machine_mode tmode, >>>>> unsigned HOST_WIDE_INT ix; >>>>> tree field, value; >>>>> >>>>> + /* Check if it is more efficient to load the data from >>>>> + the memory directly. FIXME: How many stores do we >>>>> + need here if not moved by pieces? */ >>>>> + unsigned HOST_WIDE_INT bytes >>>>> + = tree_to_uhwi (TYPE_SIZE_UNIT (type)); >>>> >>>> that's prone to fail - it could be a VLA. >>> >>> What do you mean by fail? Is it ICE or missed optimization? >>> Do you have a testcase? >>> >>>> >>>>> + if ((bytes / UNITS_PER_WORD) > 2 >>>>> + && MOVE_MAX_PIECES > UNITS_PER_WORD >>>>> + && can_move_by_pieces (bytes, TYPE_ALIGN (type))) >>>>> + goto normal_inner_ref; >>>>> + >>>> >>>> It looks like you're concerned about aggregate copies but this also handles >>>> non-aggregates (which on GIMPLE might already be optimized of course). >>> >>> Here I check if we copy more than 2 words and we can move more than >>> a word in a single instruction. >>> >>>> Also you say "if it's cheaper" but I see no cost considerations. How do >>>> we generally handle immed const vs. load from constant pool costs? >>> >>> This trades 2 (update to 8) stores with one load plus one store. Is there >>> a way to check which one is faster? >> >> I'm not sure - it depends on whether the target can do stores from immediates >> at all or what restrictions apply, what the immediate value actually is >> (zero or all-ones should be way cheaper than sth arbitrary) and how the >> pressure on the load unit is. can_move_by_pieces (bytes, TYPE_ALIGN (type)) >> also does not guarantee it will actually move pieces larger than >> UNITS_PER_WORD, >> that might depend on alignment. There's by_pieces_ninsns that might provide >> some hint here. >> >> I'm sure it works well for x86. >> >> I wonder if the existing code is in the appropriate place and we >> shouldn't instead >> handle this somewhere upthread where we ask to copy 'exp' into some other >> memory location. For your testcase that's expand_assignment but I can >> imagine passing array[0] by value to a function resulting in similar copying. >> Testing that shows we get >> >> pushq array+56(%rip) >> .cfi_def_cfa_offset 24 >> pushq array+48(%rip) >> .cfi_def_cfa_offset 32 >> pushq array+40(%rip) >> .cfi_def_cfa_offset 40 >> pushq array+32(%rip) >> .cfi_def_cfa_offset 48 >> pushq array+24(%rip) >> .cfi_def_cfa_offset 56 >> pushq array+16(%rip) >> .cfi_def_cfa_offset 64 >> pushq array+8(%rip) >> .cfi_def_cfa_offset 72 >> pushq array(%rip) >> .cfi_def_cfa_offset 80 >> call bar >> >> for that. We do have the by-pieces infrastructure to generally do this kind >> of >> copying but in both of these cases we do not seem to use it. I also wonder >> if the by-pieces infrastructure can pick up constant initializers >> automagically >> (we could native_encode the initializer part and feed the by-pieces >> infrastructure with an array of bytes). There for example might be easy to >> immediate-store byte parts and difficult ones where we could decide on a >> case-by-case basis whether to load+store or immediate-store them. > > I opened: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100704 > >> For example if I change your testcase to have the array[] initializer >> all-zero we currently emit >> >> pxor %xmm0, %xmm0 >> movups %xmm0, (%rdi) >> movups %xmm0, 16(%rdi) >> movups %xmm0, 32(%rdi) >> movups %xmm0, 48(%rdi) >> ret >> >> will your patch cause us to emit 4 loads? OTHO if I do >> >> const struct S array[] = { >> { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 } >> }; >> >> we get >> >> movq $0, (%rdi) >> movl $0, 8(%rdi) >> movl $0, 12(%rdi) >> movl $7241, 16(%rdi) >> ... >> >> ideally we'd have sth like >> >> pxor %xmm0, %xmm0 >> movups %xmm0, (%rdi) >> movaps array+16(%rip), %xmm0 >> movups %xmm0, 16(%rdi) >> ... >> >> thus have the zeros written as immediates and the remaining pieces >> with load+stores. >> >> The by-pieces infrastructure eventually get's to see >> >> (mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl >> 0x7ffff7ff5b40 array>) [1 array+0 S64 A256]) >> >> where the MEM_EXPR should provide a way to access the constant initializer. >> >> That said I do agree the current code is a bit premature optimization >> - but maybe >> it should be fend off in expand_constructor which has the cheap clear_storage >> first and which already does check can_move_by_pieces with some heuristics, >> but that seems to be guarded by >> >> || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) >> && (! can_move_by_pieces >> (tree_to_uhwi (TYPE_SIZE_UNIT (type)), >> TYPE_ALIGN (type))) >> && ! mostly_zeros_p (exp)))) >> >> which is odd (we _can_ move by pieces, but how does this apply to >> TREE_CONSTANT CTORs and avoid_temp_mem?). >> >> That said, I wonder if we want to elide expand_constructor when the >> CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p >> and we can_move_by_pieces. >> >> So sth like >> >> diff --git a/gcc/expr.c b/gcc/expr.c >> index 7139545d543..76b3bdf0c01 100644 >> --- a/gcc/expr.c >> +++ b/gcc/expr.c >> @@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum >> expand_modifier modifier, >> && (! can_move_by_pieces >> (tree_to_uhwi (TYPE_SIZE_UNIT (type)), >> TYPE_ALIGN (type))) >> + && ! mostly_zeros_p (exp)) >> + || (TREE_CONSTANT (exp) >> + && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type)) >> + && (can_move_by_pieces >> + (tree_to_uhwi (TYPE_SIZE_UNIT (type)), >> + TYPE_ALIGN (type)))
Just a minor nit: superfluous parentheses around can_move_by_pieces here. Bernd. >> && ! mostly_zeros_p (exp)))) >> || ((modifier == EXPAND_INITIALIZER || modifier == >> EXPAND_CONST_ADDRESS) >> && TREE_CONSTANT (exp))) >> >> which handles your initializer and the all-zero one optimal? >> > > It works. Here is the updated patch. > > Thanks. >