Re: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

Richard Sandiford Fri, 20 Sep 2024 07:02:53 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
> Hi All,
>
> Consider low overhead loops like:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c, int n)
> {
>   for (int i = 0; i < 9; i++)
>     {
>       int res = c[i];
>       int t = b[i];
>       if (a[i] != 0)
>         res = t;
>       c[i] = res;
>     }
> }


Eek.

> For such loops we use latency only costing since the loop bounds is known and
> small.
>
> The current costing however does not consider the case where niters < VF.
>
> So when comparing the scalar vs vector costs it doesn't keep in mind that the
> scalar code can't perform VF iterations.  This makes it overestimate the cost
> for the scalar loop and we incorrectly vectorize.

I don't think that in itself is the reason though.  niters < VF isn't
particularly exceptional for SVE.  The vector code is then costed as
performing 1 VF-element iteration plus the prologue (i.e. one full
vector iteration), whereas the scalar code is costed as doing niters
scalar iterations plus the prologue.  The lower the niters, the more
we should favour scalar code.

I suspect instead it's a combination of:

- The latency for the vector costs are too optimistic, especially given
  all the predicate overhead in the loop above, and especially for
  default tuning.  For default tuning I see:

  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7

  and grep -c '\(z[0-9]\|p[0-9]*\.\)' shows there are indeed 26
  SVE instructions.  But assuming a 1 cycle latency for every vector
  operation seems too idealistic (especially for loads).  It might be
  ok for "generic" (the "architectural intent" option), but not for
  the new architecture-level default tuning targets.

  I realise that isn't the whole story though, since -mcpu=neoverse-v2
  also vectorises.

- Having a threshold for latency costs based on niters is too simplistic,
  since the size of the loop matters too.  I think the intention was
  to capture cases where loops are so small that they are in practice
  always issued with surrounding code, so that the throughput of the
  loop itself is kind of irrelevant.  But this loop is big enough that
  throughput makes sense.

So my gut instinct is that we should instead tweak the condition for
using latency costs, but I'll need to think about it more when I get
back from holiday.

Another thing about the expansion is that we generate 4 .S instructions
for the int elements, even though it should be possible for later gimple
passes to fold the last one away at compile time.  That would lessen
the horror to some extent, but maybe not enough to make it an actual win.

Specifically:

        uqdecw  w0, all, mul #3                // this is always 0
        whilelo p7.s, wzr, w0                  // this is always a pfalse
        ld1w    z28.s, p7/z, [x1, #3, mul vl]  // this always loads zero
        and     p6.b, p15/z, p7.b, p7.b        // this is always a pfalse
        st1w    z28.s, p6, [x2, #3, mul vl]    // this is a no-op

One thing I remember considering in the past, but apparently never actually
did, was rejecting any vectorisation in which some vector instructions
are provably redundant (as above).  But perhaps in the end it seemed
that we should just be able to optimise them away later instead.

Thanks,
Richard

> This patch takes the minimum of the VF and niters in such cases.
> Before the patch we generate:
>
>  note:  Original vector body cost = 46
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:    load operations = 2
>  note:    store operations = 1
>  note:    general operations = 1
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration = 1.000000
>  note:    estimated cycles per vector iteration (for VF 32) = 32.000000
>  note:  SVE issue estimate:
>  note:    load operations = 5
>  note:    store operations = 4
>  note:    general operations = 11
>  note:    predicate operations = 12
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration without predication = 5.500000
>  note:    estimated min cycles per iteration for predication = 12.000000
>  note:    estimated min cycles per iteration = 12.000000
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> vs after:
>
>  note:  Original vector body cost = 46
>  note:  Known loop bounds, capping VF to 9 for analysis
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:    load operations = 2
>  note:    store operations = 1
>  note:    general operations = 1
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration = 1.000000
>  note:    estimated cycles per vector iteration (for VF 9) = 9.000000
>  note:  SVE issue estimate:
>  note:    load operations = 5
>  note:    store operations = 4
>  note:    general operations = 11
>  note:    predicate operations = 12
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration without predication = 5.500000
>  note:    estimated min cycles per iteration for predication = 12.000000
>  note:    estimated min cycles per iteration = 12.000000
>  note:  Increasing body cost to 1472 because the scalar code could issue 
> within the limit imposed by predicate operations
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues, and fixes the
> upcoming vectorization regression on exchange2 in SPECCPU 2017.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64.cc (adjust_body_cost):
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
>       * gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
>       * gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
>       * gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
>       * gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
>       * gcc.target/aarch64/sve/miniloop_1.c: Likewise.
>       * gcc.target/aarch64/sve/spill_6.c: Likewise.
>       * gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
>       * gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 6ccf08d1cc0a1aecfc72f95b105ace2c00b1a51d..afb58fd88795a26064c8c74f337324e3ecebc389
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17565,6 +17565,20 @@ adjust_body_cost (loop_vec_info loop_vinfo,
>      dump_printf_loc (MSG_NOTE, vect_location,
>                    "Original vector body cost = %d\n", body_cost);
>  
> +  /* If the iteration count is known and low we use latency only calculation,
> +     however if the iteration count is lower than VF then the estimate for 
> the
> +     scalar loops will be too high.  Cap it at NITERS.  */
> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    {
> +      auto niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
> +      if (niters < estimated_vf && dump_enabled_p ())
> +     dump_printf_loc (MSG_NOTE, vect_location,
> +                      "Scalar loop iterates at most %wd times.  Capping VF "
> +                      " from %d to %wd\n", niters, estimated_vf, niters);
> +
> +      estimated_vf = MIN (estimated_vf, niters);
> +    }
> +
>    fractional_cost scalar_cycles_per_iter
>      = scalar_ops.min_cycles_per_iter () * estimated_vf;
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> index 
> 6684fe1c12441399faac941bdca79d90de1eb611..10a96a894afd690cf9eb521ae5195f6669e3e2aa
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> @@ -15,12 +15,12 @@
>    }
>  
>  #define TEST_ALL(T) \
> -  T (int16_t, int8_t, 7) \
> -  T (int32_t, int8_t, 3) \
> -  T (int32_t, int16_t, 3) \
> -  T (int64_t, int8_t, 5) \
> -  T (int64_t, int16_t, 5) \
> -  T (int64_t, int32_t, 5)
> +  T (int16_t, int8_t, 70) \
> +  T (int32_t, int8_t, 30) \
> +  T (int32_t, int16_t, 30) \
> +  T (int64_t, int8_t, 50) \
> +  T (int64_t, int16_t, 50) \
> +  T (int64_t, int32_t, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> index 
> e4040ee3520c8dd50282e4aeaf4930c7f66c929c..db1721efbc7bd68f0954f1c76eb15ed75dc91cfb
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> @@ -14,12 +14,12 @@
>    }
>  
>  #define TEST_ALL(T) \
> -  T (int16_t, int8_t, 7) \
> -  T (int32_t, int8_t, 3) \
> -  T (int32_t, int16_t, 3) \
> -  T (int64_t, int8_t, 5) \
> -  T (int64_t, int16_t, 5) \
> -  T (int64_t, int32_t, 5)
> +  T (int16_t, int8_t, 70) \
> +  T (int32_t, int8_t, 30) \
> +  T (int32_t, int16_t, 30) \
> +  T (int64_t, int8_t, 50) \
> +  T (int64_t, int16_t, 50) \
> +  T (int64_t, int32_t, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> index 
> e47276a3a352bd61d2c508167344a2e60e8bc84c..b8b3e862d0a1968ddf9f51278e6b3813e16a7665
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)                  \
> -  T (int32_t, uint16_t, 0xff, 3)     \
> +  T (int32_t, uint16_t, 0xff, 30)    \
>                                       \
> -  T (int64_t, uint16_t, 0xff, 5)     \
> -  T (int64_t, uint32_t, 0xff, 5)     \
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)    \
> +  T (int64_t, uint32_t, 0xff, 50)    \
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> index 
> f49915c4ac1430b0260589156d98b0793c78999f..2d02fb70f33fcf9a76c73b03c23f5a2d33e01b92
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)                  \
> -  T (int32_t, uint16_t, 0xff, 3)     \
> +  T (int32_t, uint16_t, 0xff, 30)    \
>                                       \
> -  T (int64_t, uint16_t, 0xff, 5)     \
> -  T (int64_t, uint32_t, 0xff, 5)     \
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)    \
> +  T (int64_t, uint32_t, 0xff, 50)    \
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> index 
> 42eb4b2661b3f152763ac2c8382877e5116dedd4..8fe2455687b5a310179c26c2c6d7c70b33101612
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)                  \
> -  T (int32_t, uint16_t, 0xff, 3)     \
> +  T (int32_t, uint16_t, 0xff, 30)    \
>                                       \
> -  T (int64_t, uint16_t, 0xff, 5)     \
> -  T (int64_t, uint32_t, 0xff, 5)     \
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)    \
> +  T (int64_t, uint32_t, 0xff, 50)    \
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> index 
> 09eb4146816cc51af5829f15b6c287aca086c382..cd1fd2b8a078a3a6aa9741aea0d10a7a420e137c
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> @@ -6,7 +6,7 @@ void loop (int * __restrict__ a, int * __restrict__ b, int * 
> __restrict__ c,
>          int * __restrict__ g, int * __restrict__ h)
>  {
>    int i = 0;
> -  for (i = 0; i < 3; i++)
> +  for (i = 0; i < 30; i++)
>      {
>        a[i] += i;
>        b[i] += i;
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> index 
> ae9c338f5696a9f743696b58f3d8e1dd991de501..2ff969ced00955403eaa1aea7223e209fdb6f139
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> @@ -11,20 +11,20 @@ void consumer (void *);
>    {                                                                  \
>      if (which)                                                               
> \
>        {                                                                      
> \
> -     for (int i = 0; i < 7; ++i)                                     \
> +     for (int i = 0; i < 70; ++i)                                    \
>         x1[i] += VAL;                                                 \
>       consumer (x1);                                                  \
> -     for (int i = 0; i < 7; ++i)                                     \
> +     for (int i = 0; i < 70; ++i)                                    \
>         x2[i] -= VAL;                                                 \
>       consumer (x2);                                                  \
>        }                                                                      
> \
>      else                                                             \
>        {                                                                      
> \
> -     for (int i = 0; i < 7; ++i)                                     \
> +     for (int i = 0; i < 70; ++i)                                    \
>         x3[i] &= VAL;                                                 \
>       consumer (x3);                                                  \
>        }                                                                      
> \
> -    for (int i = 0; i < 7; ++i)                                              
> \
> +    for (int i = 0; i < 70; ++i)                                     \
>        x4[i] |= VAL;                                                  \
>      consumer (x4);                                                   \
>    }
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..952a4b1cd580c4eaed4c7c470cd1efba8e5e931d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" 
> } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c, int n)
> +{
> +  for (int i = 0; i < 9; i++)
> +    {
> +      int res = c[i];
> +      int t = b[i];
> +      if (a[i] != 0)
> +        res = t;
> +      c[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..02d10de2a6215d38728a84123d038c2db605b5a0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" 
> } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
> +{
> +  if (stride <= 1)
> +    return;
> +
> +  for (int i = 0; i < 9; i++)
> +    {
> +      int res = c[i];
> +      int t = b[i*stride];
> +      if (a[i] != 0)
> +        res = t;
> +      c[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */

Re: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

Reply via email to