Richard Biener <richard.guent...@gmail.com> writes: >> Am 08.08.2024 um 15:12 schrieb Richard Sandiford <richard.sandif...@arm.com>: >>> PR tree-optimization/116274 >>> * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads >>> and stores as simple scalar stmts when they access a non-global, >>> not address-taken variable that doesn't have BLKmode assigned. >>> >>> * gcc.target/i386/pr116274.c: New testcase. >>> --- >>> gcc/testsuite/gcc.target/i386/pr116274.c | 9 +++++++++ >>> gcc/tree-vect-slp.cc | 12 +++++++++++- >>> 2 files changed, 20 insertions(+), 1 deletion(-) >>> create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c >>> >>> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c >>> b/gcc/testsuite/gcc.target/i386/pr116274.c >>> new file mode 100644 >>> index 00000000000..d5811344b93 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c >>> @@ -0,0 +1,9 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */ >>> + >>> +struct a { long x,y; }; >>> +long test(struct a a) { return a.x+a.y; } >>> + >>> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } >>> } */ >>> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */ >>> +/* { dg-final { scan-assembler-not "padd" } } */ >>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc >>> index 3464d0c0e23..e43ff721100 100644 >>> --- a/gcc/tree-vect-slp.cc >>> +++ b/gcc/tree-vect-slp.cc >>> @@ -7807,7 +7807,17 @@ next_lane: >>> vect_cost_for_stmt kind; >>> if (STMT_VINFO_DATA_REF (orig_stmt_info)) >>> { >>> - if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info))) >>> + data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info); >>> + tree base = get_base_address (DR_REF (dr)); >>> + /* When the scalar access is to a non-global not address-taken >>> + decl that is not BLKmode assume we can access it with a single >>> + non-load/store instruction. */ >>> + if (DECL_P (base) >>> + && !is_global_var (base) >>> + && !TREE_ADDRESSABLE (base) >>> + && DECL_MODE (base) != BLKmode) >>> + kind = scalar_stmt; >>> + else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info))) >>> kind = scalar_load; >>> else >>> kind = scalar_store; >> >> LGTM FWIW, but did you consider skipping the cost altogether? >> I'm not sure what the scalar_stmt would correspond to in practice, >> if we assume that the ABI (for parameters/returns) or RA (for locals) >> puts the data in a sensible register class for the datatype. > > On x86_64 you get up to two eightbytes in two gpr or float regs, so with an > example with four int we’d get a scalar shift for the second and fourth int > and with eight short you get to the point where the vector marshaling might > be profitable. So it’s a heuristic that says it’s likely not zero cost but > definitely not as high as a load. Anything better would need to know the > actual register passings.
Ah, yeah, fair enough. I suppose that would be true for aarch64 too on things like: struct a { char f[4]; }; struct a test(struct a a) { a.f[0] += 1; a.f[1] += 2; a.f[2] += 3; a.f[3] += 4; return a; } It's just that there are important cases where it wouldn't happen for floats on aarch64, and the scalar_stmt cost for floats is typcially 2. But like you say, that could be fixed later, and this should be a strict improvement over the status quo. Richard