Richard Biener <rguent...@suse.de> writes: > The following tries to address that the vectorizer fails to have > precise knowledge of argument and return calling conventions and > views some accesses as loads and stores that are not. > This is mainly important when doing basic-block vectorization as > otherwise loop indexing would force such arguments to memory. > > On x86 the reduction in the number of apparent loads and stores > often dominates cost analysis so the following tries to mitigate > this aggressively by adjusting only the scalar load and store > cost, reducing them to the cost of a simple scalar statement, > but not touching the vector access cost which would be much > harder to estimate. Thereby we error on the side of not performing > basic-block vectorization. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > Richard - we can of course do this adjustment in the backend as well > but it might be worthwhile in generic code. Do you see similar > issues on arm?
Yeah, a pathological case is: struct a { float f[4]; }; struct a test(struct a a) { a.f[0] += 1; a.f[1] += 2; a.f[2] += 3; a.f[3] += 4; return a; } which with -O2 generates: test: .LFB0: .cfi_startproc fmov w1, s2 fmov w4, s0 mov x0, 0 fmov w3, s1 sub sp, sp, #16 .cfi_def_cfa_offset 16 mov x2, 0 bfi x0, x1, 0, 32 fmov w1, s3 bfi x2, x4, 0, 32 bfi x2, x3, 32, 32 bfi x0, x1, 32, 32 adrp x1, .LC0 stp x2, x0, [sp] ldr q30, [sp] ldr q31, [x1, #:lo12:.LC0] add sp, sp, 16 .cfi_def_cfa_offset 0 fadd v31.4s, v30.4s, v31.4s umov x0, v31.d[0] umov x1, v31.d[1] mov x3, x0 lsr x4, x0, 32 lsr x0, x1, 32 fmov s1, w4 fmov s3, w0 fmov s2, w1 lsr w0, w3, 0 fmov s0, w0 ret .cfi_endproc Admittedly most of the badness there would probably be fixed by parameter and return fsra (Jiufu Guo's patch), but it still doesn't make much sense to marshall 4 separate floats into one vector for a single addition, only to tear it apart into 4 separate floats afterwards. We should just do four scalar additions instead. (The patch doesn't fix this case, although it does trigger.) > PR tree-optimization/116274 > * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads > and stores as simple scalar stmts when they access a non-global, > not address-taken variable that doesn't have BLKmode assigned. > > * gcc.target/i386/pr116274.c: New testcase. > --- > gcc/testsuite/gcc.target/i386/pr116274.c | 9 +++++++++ > gcc/tree-vect-slp.cc | 12 +++++++++++- > 2 files changed, 20 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c > > diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c > b/gcc/testsuite/gcc.target/i386/pr116274.c > new file mode 100644 > index 00000000000..d5811344b93 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr116274.c > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */ > + > +struct a { long x,y; }; > +long test(struct a a) { return a.x+a.y; } > + > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } > */ > +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */ > +/* { dg-final { scan-assembler-not "padd" } } */ > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 3464d0c0e23..e43ff721100 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -7807,7 +7807,17 @@ next_lane: > vect_cost_for_stmt kind; > if (STMT_VINFO_DATA_REF (orig_stmt_info)) > { > - if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info))) > + data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info); > + tree base = get_base_address (DR_REF (dr)); > + /* When the scalar access is to a non-global not address-taken > + decl that is not BLKmode assume we can access it with a single > + non-load/store instruction. */ > + if (DECL_P (base) > + && !is_global_var (base) > + && !TREE_ADDRESSABLE (base) > + && DECL_MODE (base) != BLKmode) > + kind = scalar_stmt; > + else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info))) > kind = scalar_load; > else > kind = scalar_store; LGTM FWIW, but did you consider skipping the cost altogether? I'm not sure what the scalar_stmt would correspond to in practice, if we assume that the ABI (for parameters/returns) or RA (for locals) puts the data in a sensible register class for the datatype. Thanks, Richard