On Fri, Jul 7, 2023 at 9:53 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Richard Biener <richard.guent...@gmail.com> writes: > >> Am 06.07.2023 um 19:50 schrieb Richard Sandiford > >> <richard.sandif...@arm.com>: > >> > >> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > >>>> On Wed, Jul 5, 2023 at 8:44 AM Hao Liu OS via Gcc-patches > >>>> <gcc-patches@gcc.gnu.org> wrote: > >>>> > >>>> Hi, > >>>> > >>>> If a loop is unrolled by n times during vectoriation, two steps are used > >>>> to > >>>> calculate the induction variable: > >>>> - The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n * > >>>> Step) > >>>> - The large step for the whole loop: vec_loop = vec_iv + (VF * Step) > >>>> > >>>> This patch calculates an extra vec_n to replace vec_loop: > >>>> vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop. > >>>> > >>>> So that we can save the large step register and related operations. > >>> > >>> OK. It would be nice to avoid the dead stmts created earlier though. > >> > >> FWIW, I still don't think we should do this. Part of the point of > >> unrolling is to shorten loop-carried dependencies, whereas this patch > >> is going in the opposite direction. > > > > Note ncopies can be >1 without additional unrolling. > > Yeah, true. But I think even there, avoiding a longer loop-carried > dependency should be a good thing. > > > With non VLA vectors all of the updates will be constant folded btw. > > Are you sure? The motivating example is an Advanced SIMD one, > not a VLA one. No variable-length vectors are involved. > > Maybe constant folding caps the dependency chain to length 2? > But 2 is still more than 1. :)
The /* (A +- CST1) +- CST2 -> A + CST3 Use view_convert because it is safe for vectors and equivalent for scalars. */ (for outer_op (plus minus) (for inner_op (plus minus) neg_inner_op (minus plus) pattern should apply here for example during forwprop. It handles vector constants just fine, so I wonder why it doesn't trigger. Richard. > Thanks, > Richard > > > > > Richard > > > >> Richard > >> > >>> > >>> Thanks, > >>> Richard. > >>> > >>>> gcc/ChangeLog: > >>>> > >>>> PR tree-optimization/110449 > >>>> * tree-vect-loop.cc (vectorizable_induction): use vec_n to replace > >>>> vec_loop for the unrolled loop. > >>>> > >>>> gcc/testsuite/ChangeLog: > >>>> > >>>> * gcc.target/aarch64/pr110449.c: New testcase. > >>>> --- > >>>> gcc/testsuite/gcc.target/aarch64/pr110449.c | 40 +++++++++++++++++++++ > >>>> gcc/tree-vect-loop.cc | 21 +++++++++-- > >>>> 2 files changed, 58 insertions(+), 3 deletions(-) > >>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110449.c > >>>> > >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110449.c > >>>> b/gcc/testsuite/gcc.target/aarch64/pr110449.c > >>>> new file mode 100644 > >>>> index 00000000000..bb3b6dcfe08 > >>>> --- /dev/null > >>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr110449.c > >>>> @@ -0,0 +1,40 @@ > >>>> +/* { dg-do compile } */ > >>>> +/* { dg-options "-Ofast -mcpu=neoverse-n2 --param > >>>> aarch64-vect-unroll-limit=2" } */ > >>>> +/* { dg-final { scan-assembler-not "8.0e\\+0" } } */ > >>>> + > >>>> +/* Calcualte the vectorized induction with smaller step for an unrolled > >>>> loop. > >>>> + > >>>> + before (suggested_unroll_factor=2): > >>>> + fmov s30, 8.0e+0 > >>>> + fmov s31, 4.0e+0 > >>>> + dup v27.4s, v30.s[0] > >>>> + dup v28.4s, v31.s[0] > >>>> + .L6: > >>>> + mov v30.16b, v31.16b > >>>> + fadd v31.4s, v31.4s, v27.4s > >>>> + fadd v29.4s, v30.4s, v28.4s > >>>> + stp q30, q29, [x0] > >>>> + add x0, x0, 32 > >>>> + cmp x1, x0 > >>>> + bne .L6 > >>>> + > >>>> + after: > >>>> + fmov s31, 4.0e+0 > >>>> + dup v29.4s, v31.s[0] > >>>> + .L6: > >>>> + fadd v30.4s, v31.4s, v29.4s > >>>> + stp q31, q30, [x0] > >>>> + add x0, x0, 32 > >>>> + fadd v31.4s, v29.4s, v30.4s > >>>> + cmp x0, x1 > >>>> + bne .L6 */ > >>>> + > >>>> +void > >>>> +foo2 (float *arr, float freq, float step) > >>>> +{ > >>>> + for (int i = 0; i < 1024; i++) > >>>> + { > >>>> + arr[i] = freq; > >>>> + freq += step; > >>>> + } > >>>> +} > >>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > >>>> index 3b46c58a8d8..706ecbffd0c 100644 > >>>> --- a/gcc/tree-vect-loop.cc > >>>> +++ b/gcc/tree-vect-loop.cc > >>>> @@ -10114,7 +10114,7 @@ vectorizable_induction (loop_vec_info loop_vinfo, > >>>> new_vec, step_vectype, NULL); > >>>> > >>>> vec_def = induc_def; > >>>> - for (i = 1; i < ncopies; i++) > >>>> + for (i = 1; i < ncopies + 1; i++) > >>>> { > >>>> /* vec_i = vec_prev + vec_step */ > >>>> gimple_seq stmts = NULL; > >>>> @@ -10124,8 +10124,23 @@ vectorizable_induction (loop_vec_info > >>>> loop_vinfo, > >>>> vec_def = gimple_convert (&stmts, vectype, vec_def); > >>>> > >>>> gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT); > >>>> - new_stmt = SSA_NAME_DEF_STMT (vec_def); > >>>> - STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); > >>>> + if (i < ncopies) > >>>> + { > >>>> + new_stmt = SSA_NAME_DEF_STMT (vec_def); > >>>> + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); > >>>> + } > >>>> + else > >>>> + { > >>>> + /* vec_1 = vec_iv + (VF/n * S) > >>>> + vec_2 = vec_1 + (VF/n * S) > >>>> + ... > >>>> + vec_n = vec_prev + (VF/n * S) = vec_iv + VF * S = > >>>> vec_loop > >>>> + > >>>> + vec_n is used as vec_loop to save the large step > >>>> register and > >>>> + related operations. */ > >>>> + add_phi_arg (induction_phi, vec_def, loop_latch_edge > >>>> (iv_loop), > >>>> + UNKNOWN_LOCATION); > >>>> + } > >>>> } > >>>> } > >>>> > >>>> -- > >>>> 2.34.1