Hi Richard, This patch causes a regression for arm-none-eabi (log snippet is from a more recent build of GCC):
Testing complex/fast-math-complex-add-pattern-half-float.c doing compile Executing on host: /build/r16-7849-g1f9879e17466f5/bin/arm-none-eabi-gcc /build/gcc_src/gcc/testsuite/gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c -mthumb -march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto -fdiagnostics-plain-output -mfloat-abi=softfp -mcpu=unset -march=armv7-a+simd -mfpu=auto -ffast-math -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details -ffast-math -mfloat-abi=softfp -mfpu=auto -mcpu=unset -march=armv8.3-a+fp16+simd -S -o fast-math-complex-add-pattern-half-float.s (timeout = 800) spawn -ignore SIGHUP /build/r16-7849-g1f9879e17466f5/bin/arm-none-eabi-gcc /build/gcc_src/gcc/testsuite/gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c -mthumb -march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto -fdiagnostics-plain-output -mfloat-abi=softfp -mcpu=unset -march=armv7-a+simd -mfpu=auto -ffast-math -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details -ffast-math -mfloat-abi=softfp -mfpu=auto -mcpu=unset -march=armv8.3-a+fp16+simd -S -o fast-math-complex-add-pattern-half-float.s pid is 2955587 -2955587 pid is -1 output is status 0 PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c (test for excess errors) gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c: pattern found 4 times FAIL: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c scan-tree-dump-times vect "add new stmt: [^\n\r]*COMPLEX_ADD_ROT90" 3 PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c scan-tree-dump-times vect "add new stmt: [^\n\r]*COMPLEX_ADD_ROT270" 1 PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270" PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c scan-tree-dump vect "Found COMPLEX_ADD_ROT90" Should the test be updated to accept 4 pattern matches or is it wrong to have 4 matches instead of 3? Kind regards, Torbjörn On 2026-01-14 12:52, Richard Biener wrote:
The following adjusts the condition where we reject vectorization because the scalar loop runs only for a single iteration (or two, in case we need to peel for gaps). Because this is over-eager when considering the case of VF == 1 where instead the cost model should decide wheter it is worthwhile or not. I'm playing conservative here and exclude the case of two iterations as I do not have benchmark evidence. This helps fixing a regression observed with improved SLP handling, not exactly for the options used in the PR though, but for a more common -O3 -march=x86-64-v3 this speeds up 433.milc by 6%. Bootstrapped and tested on x86_64-unknown-linux-gnu, will push later. PR tree-optimization/123190 * tree-vect-loop.cc (vect_analyze_loop_costing): Allow vectorizing loops with a single scalar iteration iff the vectorization factor is 1. * gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c: New testcase. * gcc.dg/vect/slp-28.c: Avoid epilogue vectorization for simplicity. --- .../costmodel/x86_64/costmodel-pr123190-1.c | 38 +++++++++++++++++++ gcc/testsuite/gcc.dg/vect/slp-28.c | 1 + gcc/tree-vect-loop.cc | 8 +++- 3 files changed, 45 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c new file mode 100644 index 00000000000..4265ac80a43 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c @@ -0,0 +1,38 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O3 -mavx2 -mno-avx512f -mtune=generic" } */ + +typedef struct { + double real; + double imag; +} complex; + +typedef struct { complex e[3][3]; } su3_matrix; + +void mult_su3_na( su3_matrix *a, su3_matrix *b, su3_matrix *c ){ +int i,j; +register double t,ar,ai,br,bi,cr,ci; + for(i=0;i<3;i++) + for(j=0;j<3;j++){ + + ar=a->e[i][0].real; ai=a->e[i][0].imag; + br=b->e[j][0].real; bi=b->e[j][0].imag; + cr=ar*br; t=ai*bi; cr += t; + ci=ai*br; t=ar*bi; ci -= t; + + ar=a->e[i][1].real; ai=a->e[i][1].imag; + br=b->e[j][1].real; bi=b->e[j][1].imag; + t=ar*br; cr += t; t=ai*bi; cr += t; + t=ar*bi; ci -= t; t=ai*br; ci += t; + + ar=a->e[i][2].real; ai=a->e[i][2].imag; + br=b->e[j][2].real; bi=b->e[j][2].imag; + t=ar*br; cr += t; t=ai*bi; cr += t; + t=ar*bi; ci -= t; t=ai*br; ci += t; + + c->e[i][j].real=cr; + c->e[i][j].imag=ci; + } +} + +/* { dg-final { scan-tree-dump "optimized: loop vectorized using 32" "vect" } } */ +/* { dg-final { scan-tree-dump "optimized: epilogue loop vectorized using 16 byte vectors and unroll factor 1" "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-28.c b/gcc/testsuite/gcc.dg/vect/slp-28.c index 1f987874f0d..bf6271eed25 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-28.c +++ b/gcc/testsuite/gcc.dg/vect/slp-28.c @@ -1,4 +1,5 @@ /* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */#include <stdarg.h>#include "tree-vect.h" diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 74eecb832e6..fdf544fa47b 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -1792,9 +1792,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, } } /* Reject vectorizing for a single scalar iteration, even if - we could in principle implement that using partial vectors. */ + we could in principle implement that using partial vectors. + But allow such vectorization if VF == 1 in case we do not + need to peel for gaps (if we need, avoid vectorization for + reasons of code footprint). */ unsigned peeling_gap = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); - if (scalar_niters <= peeling_gap + 1) + if (scalar_niters <= peeling_gap + 1 + && (assumed_vf > 1 || peeling_gap != 0)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
