On Wed, 27 Apr 2022, Andre Vieira (lists) wrote: > > On 27/04/2022 07:35, Richard Biener wrote: > > On Tue, 26 Apr 2022, Richard Sandiford wrote: > > > >> "Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes: > >>> Hi, > >>> > >>> This patch disables epilogue vectorization when we are peeling for > >>> alignment in the prologue and we can't guarantee the main vectorized > >>> loop is entered. This is to prevent executing vectorized code with an > >>> unaligned access if the target has indicated it wants to peel for > >>> alignment. We take this conservative approach as we currently do not > >>> distinguish between peeling for alignment for correctness or for > >>> performance. > >>> > >>> A better codegen would be to make it skip to the scalar epilogue in case > >>> the main loop isn't entered when alignment peeling is required. However, > >>> that would require a more aggressive change to the codebase which we > >>> chose to avoid at this point of development. We can revisit this option > >>> during stage 1 if we choose to. > >>> > >>> Bootstrapped on aarch64-none-linux and regression tested on > >>> aarch64-none-elf. > >>> > >>> gcc/ChangeLog: > >>> > >>> PR tree-optimization/105219 > >>> * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New > >>> function. > >>> (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment > >>> to determine > >>> whether to vectorize epilogue. > >>> * testsuite/gcc.target/aarch64/pr105219.c: New. > >>> * testsuite/gcc.target/aarch64/pr105219-2.c: New. > >>> * testsuite/gcc.target/aarch64/pr105219-3.c: New. > >>> > >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c > >>> b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c > >>> new file mode 100644 > >>> index > >>> 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe > >>> --- /dev/null > >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c > >>> @@ -0,0 +1,29 @@ > >>> +/* { dg-do run } */ > >>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx > >>> -fno-vect-cost-model" } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { > >>> "-march=armv8.2-a" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { > >>> "-mtune=thunderx" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */ > >> I think this should be in gcc.dg/vect, with the options forced > >> for { target aarch64 }. > >> > >> Are the skips necessary? It looks like the test should work correctly > >> for all options/targets. > >> > >>> +/* PR 105219. */ > >>> +int data[128]; > >>> + > >>> +void __attribute((noipa)) > >>> +foo (int *data, int n) > >>> +{ > >>> + for (int i = 0; i < n; ++i) > >>> + data[i] = i; > >>> +} > >>> + > >>> +int main() > >>> +{ > >>> + for (int start = 0; start < 16; ++start) > >>> + for (int n = 1; n < 3*16; ++n) > >>> + { > >>> + __builtin_memset (data, 0, sizeof (data)); > >>> + foo (&data[start], n); > >>> + for (int j = 0; j < n; ++j) > >>> + if (data[start + j] != j) > >>> + __builtin_abort (); > >>> + } > >>> + return 0; > >>> +} > >>> + > >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c > >>> b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c > >>> new file mode 100644 > >>> index > >>> 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392 > >>> --- /dev/null > >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c > >>> @@ -0,0 +1,15 @@ > >>> +/* { dg-do compile } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { > >>> "-march=armv8.2-a" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { > >>> "-mtune=thunderx" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */ > >>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx > >>> -fno-vect-cost-model -fdump-tree-vect-all" } */ > >>> +/* PR 105219. */ > >>> +int data[128]; > >>> + > >>> +void foo (void) > >>> +{ > >>> + for (int i = 0; i < 9; ++i) > >>> + data[i + 1] = i; > >>> +} > >>> + > >>> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */ > >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c > >>> b/gcc/testsuite/gcc.target/aarch64/pr105219.c > >>> new file mode 100644 > >>> index > >>> 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c > >>> --- /dev/null > >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c > >>> @@ -0,0 +1,28 @@ > >>> +/* { dg-do run { target aarch64_sve128_hw } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { > >>> "-march=armv8.2-a+sve" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { > >>> "-mtune=thunderx" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */ > >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" > >>> } { "-msve-vector-bits=128" } } */ > >>> +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 > >>> -mtune=thunderx" } */ > >> Same here. > >> > >>> +/* PR 105219. */ > >>> +int a; > >>> +char b[60]; > >>> +short c[18]; > >>> +short d[4][19]; > >>> +long long f; > >>> +void e(int g, int h, short k[][19]) { > >>> + for (signed i = 0; i < 3; i += 2) > >>> + for (signed j = 1; j < h + 14; j++) { > >>> + b[i * 14 + j] = 1; > >>> + c[i + j] = k[2][j]; > >>> + a = g ? k[i][j] : 0; > >>> + } > >>> +} > >>> +int main() { > >>> + e(9, 1, d); > >>> + for (long l = 0; l < 6; ++l) > >>> + for (long m = 0; m < 4; ++m) > >>> + f ^= b[l + m * 4]; > >>> + if (f) > >>> + __builtin_abort (); > >>> +} > >>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > >>> index > >>> d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 > >>> 100644 > >>> --- a/gcc/tree-vect-loop.cc > >>> +++ b/gcc/tree-vect-loop.cc > >>> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, > >>> @@ vec_info_shared *shared, > >>> return opt_loop_vec_info::success (loop_vinfo); > >>> } > >>> > >>> +/* Function vect_epilogue_when_peeling_for_alignment > >>> + > >>> + PR 105219: If we are peeling for alignment in the prologue then we do > >>> not > >>> + vectorize the epilogue unless we are certain we will enter the main > >>> + vectorized loop. This is to prevent entering the vectorized epilogue > >>> in > >>> + case there aren't enough iterations to enter the main loop. > >>> +*/ > >>> + > >>> +static bool > >>> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo) > >>> +{ > >>> + if (vect_use_loop_mask_for_alignment_p (loop_vinfo)) > >>> + return true; > >>> + > >>> + int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); > >>> + if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > >>> + { > >>> + poly_uint64 niters_for_main > >>> + = upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo), > >>> + LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)); > >>> + niters_for_main > >>> + = upper_bound (niters_for_main, > >>> + LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo)); > >>> + niters_for_main += prologue_peeling; > >>> + if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main)) > >>> + return false; > >>> + } > >>> + else if (prologue_peeling < 0) > >> I was surprised that this tests < 0 rather than != 0. If that's the > >> right behaviour, could you add a comment saying why? I would have > >> expected: > >> > >> prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > >> > >> to be handled more conservatively than: > >> > >> prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > >> > >> LGTM otherwise, but Richard B should have the final say. > > It works for me. Note it strictly prefers peeling for alignment over > > epilogue vectorization (which is probably OK in most cases), but the > > number of targets affected is quite small. > > > > For GCC 13 we should see to make skipping to the scalar epilogue > > possible. > > > > Thanks, > > Richard. > > > >> Thanks, > >> Richard > >> > >>> + return false; > >>> + return true; > >>> +} > >>> + > >>> /* Function vect_analyze_loop. > >>> > >>> Apply a set of analyses on LOOP, and create a loop_vec_info struct > >>> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared > >>> @@ *shared) > >>> } > >>> } > >>> /* For now only allow one epilogue loop. */ > >>> - if (first_loop_vinfo->epilogue_vinfos.is_empty ()) > >>> + if (first_loop_vinfo->epilogue_vinfos.is_empty () > >>> + && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo)) > >>> { > >>> first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo); > >>> poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); > IIUC after IRC and PR discussions we are dropping this approach in favour of > changing the loop_iteration tightening as we now believe it's fine to enter > the vectorized epilogue without peeling?
Yes, I think we are conservative there, at least I don't see too big alignment used and I saw the DRs being offsetted by a correct amount in update_epilogue_loop_vinfo. So unless we have contrary indication we should be safe here. You can look at a -gimple suffixed dump where for example with -mavx2 I see __MEM <vector(8) int> ((int *)vectp_data.13_72) = vect_vec_iv_.12_67; in the main loop (alignment according to the type) and __MEM <vector(4) int, 32> ((int *)vectp_data.21_112) = vect_vec_iv_.20_106; in the epilogue (32 bit alignent). Richard.