Richard Biener <richard.guent...@gmail.com> writes:
>> Am 08.08.2024 um 15:12 schrieb Richard Sandiford <richard.sandif...@arm.com>:
>>>    PR tree-optimization/116274
>>>    * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
>>>    and stores as simple scalar stmts when they access a non-global,
>>>    not address-taken variable that doesn't have BLKmode assigned.
>>> 
>>>    * gcc.target/i386/pr116274.c: New testcase.
>>> ---
>>> gcc/testsuite/gcc.target/i386/pr116274.c |  9 +++++++++
>>> gcc/tree-vect-slp.cc                     | 12 +++++++++++-
>>> 2 files changed, 20 insertions(+), 1 deletion(-)
>>> create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>>> 
>>> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c 
>>> b/gcc/testsuite/gcc.target/i386/pr116274.c
>>> new file mode 100644
>>> index 00000000000..d5811344b93
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c
>>> @@ -0,0 +1,9 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */
>>> +
>>> +struct a { long x,y; };
>>> +long test(struct a a) { return a.x+a.y; }
>>> +
>>> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } 
>>> } */
>>> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */
>>> +/* { dg-final { scan-assembler-not "padd" } } */
>>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>>> index 3464d0c0e23..e43ff721100 100644
>>> --- a/gcc/tree-vect-slp.cc
>>> +++ b/gcc/tree-vect-slp.cc
>>> @@ -7807,7 +7807,17 @@ next_lane:
>>>       vect_cost_for_stmt kind;
>>>       if (STMT_VINFO_DATA_REF (orig_stmt_info))
>>>    {
>>> -      if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>>> +      data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info);
>>> +      tree base = get_base_address (DR_REF (dr));
>>> +      /* When the scalar access is to a non-global not address-taken
>>> +         decl that is not BLKmode assume we can access it with a single
>>> +         non-load/store instruction.  */
>>> +      if (DECL_P (base)
>>> +          && !is_global_var (base)
>>> +          && !TREE_ADDRESSABLE (base)
>>> +          && DECL_MODE (base) != BLKmode)
>>> +        kind = scalar_stmt;
>>> +      else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>>>        kind = scalar_load;
>>>      else
>>>        kind = scalar_store;
>> 
>> LGTM FWIW, but did you consider skipping the cost altogether?
>> I'm not sure what the scalar_stmt would correspond to in practice,
>> if we assume that the ABI (for parameters/returns) or RA (for locals)
>> puts the data in a sensible register class for the datatype.
>
> On x86_64 you get up to two eightbytes in two gpr or float regs, so with an 
> example with four int we’d get a scalar shift for the second and fourth int 
> and with eight short you get to the point where the vector marshaling might 
> be profitable.  So it’s a heuristic that says it’s likely not zero cost but 
> definitely not as high as a load.  Anything better would need to know the 
> actual register passings.

Ah, yeah, fair enough.  I suppose that would be true for aarch64 too
on things like:

  struct a { char f[4]; };
  struct a test(struct a a) {
    a.f[0] += 1;
    a.f[1] += 2;
    a.f[2] += 3;
    a.f[3] += 4;
    return a;
  }

It's just that there are important cases where it wouldn't happen for
floats on aarch64, and the scalar_stmt cost for floats is typcially 2.

But like you say, that could be fixed later, and this should be a
strict improvement over the status quo.

Richard

Reply via email to