> Am 08.08.2024 um 15:12 schrieb Richard Sandiford <richard.sandif...@arm.com>:
>
> Richard Biener <rguent...@suse.de> writes:
>> The following tries to address that the vectorizer fails to have
>> precise knowledge of argument and return calling conventions and
>> views some accesses as loads and stores that are not.
>> This is mainly important when doing basic-block vectorization as
>> otherwise loop indexing would force such arguments to memory.
>>
>> On x86 the reduction in the number of apparent loads and stores
>> often dominates cost analysis so the following tries to mitigate
>> this aggressively by adjusting only the scalar load and store
>> cost, reducing them to the cost of a simple scalar statement,
>> but not touching the vector access cost which would be much
>> harder to estimate. Thereby we error on the side of not performing
>> basic-block vectorization.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>>
>> Richard - we can of course do this adjustment in the backend as well
>> but it might be worthwhile in generic code. Do you see similar
>> issues on arm?
>
> Yeah, a pathological case is:
>
> struct a { float f[4]; };
> struct a test(struct a a) {
> a.f[0] += 1;
> a.f[1] += 2;
> a.f[2] += 3;
> a.f[3] += 4;
> return a;
> }
>
> which with -O2 generates:
>
> test:
> .LFB0:
> .cfi_startproc
> fmov w1, s2
> fmov w4, s0
> mov x0, 0
> fmov w3, s1
> sub sp, sp, #16
> .cfi_def_cfa_offset 16
> mov x2, 0
> bfi x0, x1, 0, 32
> fmov w1, s3
> bfi x2, x4, 0, 32
> bfi x2, x3, 32, 32
> bfi x0, x1, 32, 32
> adrp x1, .LC0
> stp x2, x0, [sp]
> ldr q30, [sp]
> ldr q31, [x1, #:lo12:.LC0]
> add sp, sp, 16
> .cfi_def_cfa_offset 0
> fadd v31.4s, v30.4s, v31.4s
> umov x0, v31.d[0]
> umov x1, v31.d[1]
> mov x3, x0
> lsr x4, x0, 32
> lsr x0, x1, 32
> fmov s1, w4
> fmov s3, w0
> fmov s2, w1
> lsr w0, w3, 0
> fmov s0, w0
> ret
> .cfi_endproc
>
> Admittedly most of the badness there would probably be fixed by
> parameter and return fsra (Jiufu Guo's patch), but it still doesn't
> make much sense to marshall 4 separate floats into one vector for
> a single addition, only to tear it apart into 4 separate floats
> afterwards. We should just do four scalar additions instead.
>
> (The patch doesn't fix this case, although it does trigger.)
>
>> PR tree-optimization/116274
>> * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
>> and stores as simple scalar stmts when they access a non-global,
>> not address-taken variable that doesn't have BLKmode assigned.
>>
>> * gcc.target/i386/pr116274.c: New testcase.
>> ---
>> gcc/testsuite/gcc.target/i386/pr116274.c | 9 +++++++++
>> gcc/tree-vect-slp.cc | 12 +++++++++++-
>> 2 files changed, 20 insertions(+), 1 deletion(-)
>> create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>>
>> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c
>> b/gcc/testsuite/gcc.target/i386/pr116274.c
>> new file mode 100644
>> index 00000000000..d5811344b93
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c
>> @@ -0,0 +1,9 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */
>> +
>> +struct a { long x,y; };
>> +long test(struct a a) { return a.x+a.y; }
>> +
>> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } }
>> */
>> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */
>> +/* { dg-final { scan-assembler-not "padd" } } */
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index 3464d0c0e23..e43ff721100 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -7807,7 +7807,17 @@ next_lane:
>> vect_cost_for_stmt kind;
>> if (STMT_VINFO_DATA_REF (orig_stmt_info))
>> {
>> - if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>> + data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info);
>> + tree base = get_base_address (DR_REF (dr));
>> + /* When the scalar access is to a non-global not address-taken
>> + decl that is not BLKmode assume we can access it with a single
>> + non-load/store instruction. */
>> + if (DECL_P (base)
>> + && !is_global_var (base)
>> + && !TREE_ADDRESSABLE (base)
>> + && DECL_MODE (base) != BLKmode)
>> + kind = scalar_stmt;
>> + else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>> kind = scalar_load;
>> else
>> kind = scalar_store;
>
> LGTM FWIW, but did you consider skipping the cost altogether?
> I'm not sure what the scalar_stmt would correspond to in practice,
> if we assume that the ABI (for parameters/returns) or RA (for locals)
> puts the data in a sensible register class for the datatype.
On x86_64 you get up to two eightbytes in two gpr or float regs, so with an
example with four int we’d get a scalar shift for the second and fourth int and
with eight short you get to the point where the vector marshaling might be
profitable. So it’s a heuristic that says it’s likely not zero cost but
definitely not as high as a load. Anything better would need to know the
actual register passings.
I still have to look at the expand SRA thing - eventually the part computing
the imcoming and outgoing register assignment could be split out and invoked by
the vectorizer.
Note I tried to come up with a patch suitable for backporting since on x86 this
is a regression in 14 where we added a lot more reduc_scal patterns.
Richard
>
> Thanks,
> Richard