Re: [PATCH] Vect: use a small step to calculate induction for the unrolled loop (PR tree-optimization/110449)

Richard Biener via Gcc-patches Fri, 07 Jul 2023 03:23:29 -0700

On Fri, Jul 7, 2023 at 9:53 AM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Richard Biener <richard.guent...@gmail.com> writes:
> >> Am 06.07.2023 um 19:50 schrieb Richard Sandiford 
> >> <richard.sandif...@arm.com>:
> >>
> >> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >>>> On Wed, Jul 5, 2023 at 8:44 AM Hao Liu OS via Gcc-patches
> >>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> If a loop is unrolled by n times during vectoriation, two steps are used 
> >>>> to
> >>>> calculate the induction variable:
> >>>>  - The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n * 
> >>>> Step)
> >>>>  - The large step for the whole loop: vec_loop = vec_iv + (VF * Step)
> >>>>
> >>>> This patch calculates an extra vec_n to replace vec_loop:
> >>>>  vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop.
> >>>>
> >>>> So that we can save the large step register and related operations.
> >>>
> >>> OK.  It would be nice to avoid the dead stmts created earlier though.
> >>
> >> FWIW, I still don't think we should do this.  Part of the point of
> >> unrolling is to shorten loop-carried dependencies, whereas this patch
> >> is going in the opposite direction.
> >
> > Note ncopies can be >1 without additional unrolling.
>
> Yeah, true.  But I think even there, avoiding a longer loop-carried
> dependency should be a good thing.
>
> > With non VLA vectors all of the updates will be constant folded btw.
>
> Are you sure?  The motivating example is an Advanced SIMD one,
> not a VLA one.  No variable-length vectors are involved.
>
> Maybe constant folding caps the dependency chain to length 2?
> But 2 is still more than 1. :)


The

  /* (A +- CST1) +- CST2 -> A + CST3
     Use view_convert because it is safe for vectors and equivalent for
     scalars.  */
  (for outer_op (plus minus)
   (for inner_op (plus minus)
        neg_inner_op (minus plus)

pattern should apply here for example during forwprop.  It handles
vector constants just fine, so I wonder why it doesn't trigger.

Richard.

> Thanks,
> Richard
>
> >
> > Richard
> >
> >> Richard
> >>
> >>>
> >>> Thanks,
> >>> Richard.
> >>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>>        PR tree-optimization/110449
> >>>>        * tree-vect-loop.cc (vectorizable_induction): use vec_n to replace
> >>>>        vec_loop for the unrolled loop.
> >>>>
> >>>> gcc/testsuite/ChangeLog:
> >>>>
> >>>>        * gcc.target/aarch64/pr110449.c: New testcase.
> >>>> ---
> >>>> gcc/testsuite/gcc.target/aarch64/pr110449.c | 40 +++++++++++++++++++++
> >>>> gcc/tree-vect-loop.cc                       | 21 +++++++++--
> >>>> 2 files changed, 58 insertions(+), 3 deletions(-)
> >>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110449.c
> >>>>
> >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110449.c 
> >>>> b/gcc/testsuite/gcc.target/aarch64/pr110449.c
> >>>> new file mode 100644
> >>>> index 00000000000..bb3b6dcfe08
> >>>> --- /dev/null
> >>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr110449.c
> >>>> @@ -0,0 +1,40 @@
> >>>> +/* { dg-do compile } */
> >>>> +/* { dg-options "-Ofast -mcpu=neoverse-n2 --param 
> >>>> aarch64-vect-unroll-limit=2" } */
> >>>> +/* { dg-final { scan-assembler-not "8.0e\\+0" } } */
> >>>> +
> >>>> +/* Calcualte the vectorized induction with smaller step for an unrolled 
> >>>> loop.
> >>>> +
> >>>> +   before (suggested_unroll_factor=2):
> >>>> +         fmov    s30, 8.0e+0
> >>>> +         fmov    s31, 4.0e+0
> >>>> +         dup     v27.4s, v30.s[0]
> >>>> +         dup     v28.4s, v31.s[0]
> >>>> +     .L6:
> >>>> +         mov     v30.16b, v31.16b
> >>>> +         fadd    v31.4s, v31.4s, v27.4s
> >>>> +         fadd    v29.4s, v30.4s, v28.4s
> >>>> +         stp     q30, q29, [x0]
> >>>> +         add     x0, x0, 32
> >>>> +         cmp     x1, x0
> >>>> +         bne     .L6
> >>>> +
> >>>> +   after:
> >>>> +         fmov    s31, 4.0e+0
> >>>> +         dup     v29.4s, v31.s[0]
> >>>> +     .L6:
> >>>> +         fadd    v30.4s, v31.4s, v29.4s
> >>>> +         stp     q31, q30, [x0]
> >>>> +         add     x0, x0, 32
> >>>> +         fadd    v31.4s, v29.4s, v30.4s
> >>>> +         cmp     x0, x1
> >>>> +         bne     .L6  */
> >>>> +
> >>>> +void
> >>>> +foo2 (float *arr, float freq, float step)
> >>>> +{
> >>>> +  for (int i = 0; i < 1024; i++)
> >>>> +    {
> >>>> +      arr[i] = freq;
> >>>> +      freq += step;
> >>>> +    }
> >>>> +}
> >>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> >>>> index 3b46c58a8d8..706ecbffd0c 100644
> >>>> --- a/gcc/tree-vect-loop.cc
> >>>> +++ b/gcc/tree-vect-loop.cc
> >>>> @@ -10114,7 +10114,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
> >>>>                                   new_vec, step_vectype, NULL);
> >>>>
> >>>>       vec_def = induc_def;
> >>>> -      for (i = 1; i < ncopies; i++)
> >>>> +      for (i = 1; i < ncopies + 1; i++)
> >>>>        {
> >>>>          /* vec_i = vec_prev + vec_step  */
> >>>>          gimple_seq stmts = NULL;
> >>>> @@ -10124,8 +10124,23 @@ vectorizable_induction (loop_vec_info 
> >>>> loop_vinfo,
> >>>>          vec_def = gimple_convert (&stmts, vectype, vec_def);
> >>>>
> >>>>          gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT);
> >>>> -         new_stmt = SSA_NAME_DEF_STMT (vec_def);
> >>>> -         STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
> >>>> +         if (i < ncopies)
> >>>> +           {
> >>>> +             new_stmt = SSA_NAME_DEF_STMT (vec_def);
> >>>> +             STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
> >>>> +           }
> >>>> +         else
> >>>> +           {
> >>>> +             /* vec_1 = vec_iv + (VF/n * S)
> >>>> +                vec_2 = vec_1 + (VF/n * S)
> >>>> +                ...
> >>>> +                vec_n = vec_prev + (VF/n * S) = vec_iv + VF * S = 
> >>>> vec_loop
> >>>> +
> >>>> +                vec_n is used as vec_loop to save the large step 
> >>>> register and
> >>>> +                related operations.  */
> >>>> +             add_phi_arg (induction_phi, vec_def, loop_latch_edge 
> >>>> (iv_loop),
> >>>> +                          UNKNOWN_LOCATION);
> >>>> +           }
> >>>>        }
> >>>>     }
> >>>>
> >>>> --
> >>>> 2.34.1

Re: [PATCH] Vect: use a small step to calculate induction for the unrolled loop (PR tree-optimization/110449)

Reply via email to