> Am 08.08.2024 um 15:12 schrieb Richard Sandiford <richard.sandif...@arm.com>:
> 
> Richard Biener <rguent...@suse.de> writes:
>> The following tries to address that the vectorizer fails to have
>> precise knowledge of argument and return calling conventions and
>> views some accesses as loads and stores that are not.
>> This is mainly important when doing basic-block vectorization as
>> otherwise loop indexing would force such arguments to memory.
>> 
>> On x86 the reduction in the number of apparent loads and stores
>> often dominates cost analysis so the following tries to mitigate
>> this aggressively by adjusting only the scalar load and store
>> cost, reducing them to the cost of a simple scalar statement,
>> but not touching the vector access cost which would be much
>> harder to estimate.  Thereby we error on the side of not performing
>> basic-block vectorization.
>> 
>> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> 
>> Richard - we can of course do this adjustment in the backend as well
>> but it might be worthwhile in generic code.  Do you see similar
>> issues on arm?
> 
> Yeah, a pathological case is:
> 
> struct a { float f[4]; };
> struct a test(struct a a) {
>  a.f[0] += 1;
>  a.f[1] += 2;
>  a.f[2] += 3;
>  a.f[3] += 4;
>  return a;
> }
> 
> which with -O2 generates:
> 
> test:
> .LFB0:
>        .cfi_startproc
>        fmov    w1, s2
>        fmov    w4, s0
>        mov     x0, 0
>        fmov    w3, s1
>        sub     sp, sp, #16
>        .cfi_def_cfa_offset 16
>        mov     x2, 0
>        bfi     x0, x1, 0, 32
>        fmov    w1, s3
>        bfi     x2, x4, 0, 32
>        bfi     x2, x3, 32, 32
>        bfi     x0, x1, 32, 32
>        adrp    x1, .LC0
>        stp     x2, x0, [sp]
>        ldr     q30, [sp]
>        ldr     q31, [x1, #:lo12:.LC0]
>        add     sp, sp, 16
>        .cfi_def_cfa_offset 0
>        fadd    v31.4s, v30.4s, v31.4s
>        umov    x0, v31.d[0]
>        umov    x1, v31.d[1]
>        mov     x3, x0
>        lsr     x4, x0, 32
>        lsr     x0, x1, 32
>        fmov    s1, w4
>        fmov    s3, w0
>        fmov    s2, w1
>        lsr     w0, w3, 0
>        fmov    s0, w0
>        ret
>        .cfi_endproc
> 
> Admittedly most of the badness there would probably be fixed by
> parameter and return fsra (Jiufu Guo's patch), but it still doesn't
> make much sense to marshall 4 separate floats into one vector for
> a single addition, only to tear it apart into 4 separate floats
> afterwards.  We should just do four scalar additions instead.
> 
> (The patch doesn't fix this case, although it does trigger.)
> 
>>    PR tree-optimization/116274
>>    * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
>>    and stores as simple scalar stmts when they access a non-global,
>>    not address-taken variable that doesn't have BLKmode assigned.
>> 
>>    * gcc.target/i386/pr116274.c: New testcase.
>> ---
>> gcc/testsuite/gcc.target/i386/pr116274.c |  9 +++++++++
>> gcc/tree-vect-slp.cc                     | 12 +++++++++++-
>> 2 files changed, 20 insertions(+), 1 deletion(-)
>> create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>> 
>> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c 
>> b/gcc/testsuite/gcc.target/i386/pr116274.c
>> new file mode 100644
>> index 00000000000..d5811344b93
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c
>> @@ -0,0 +1,9 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */
>> +
>> +struct a { long x,y; };
>> +long test(struct a a) { return a.x+a.y; }
>> +
>> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } 
>> */
>> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */
>> +/* { dg-final { scan-assembler-not "padd" } } */
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index 3464d0c0e23..e43ff721100 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -7807,7 +7807,17 @@ next_lane:
>>       vect_cost_for_stmt kind;
>>       if (STMT_VINFO_DATA_REF (orig_stmt_info))
>>    {
>> -      if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>> +      data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info);
>> +      tree base = get_base_address (DR_REF (dr));
>> +      /* When the scalar access is to a non-global not address-taken
>> +         decl that is not BLKmode assume we can access it with a single
>> +         non-load/store instruction.  */
>> +      if (DECL_P (base)
>> +          && !is_global_var (base)
>> +          && !TREE_ADDRESSABLE (base)
>> +          && DECL_MODE (base) != BLKmode)
>> +        kind = scalar_stmt;
>> +      else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>>        kind = scalar_load;
>>      else
>>        kind = scalar_store;
> 
> LGTM FWIW, but did you consider skipping the cost altogether?
> I'm not sure what the scalar_stmt would correspond to in practice,
> if we assume that the ABI (for parameters/returns) or RA (for locals)
> puts the data in a sensible register class for the datatype.

On x86_64 you get up to two eightbytes in two gpr or float regs, so with an 
example with four int we’d get a scalar shift for the second and fourth int and 
with eight short you get to the point where the vector marshaling might be 
profitable.  So it’s a heuristic that says it’s likely not zero cost but 
definitely not as high as a load.  Anything better would need to know the 
actual register passings.

I still have to look at the expand SRA thing - eventually the part computing 
the imcoming and outgoing register assignment could be split out and invoked by 
the vectorizer.

Note I tried to come up with a patch suitable for backporting since on x86 this 
is a regression in 14 where we added a lot more reduc_scal patterns.

Richard 

> 
> Thanks,
> Richard

Reply via email to