On Tue, 3 Mar 2026, Torbjorn SVENSSON wrote:
>
>
> On 2026-03-03 13:19, Richard Biener wrote:
> > On Tue, 3 Mar 2026, Torbjorn SVENSSON wrote:
> >
> >> Hi Richard,
> >>
> >> This patch causes a regression for arm-none-eabi (log snippet is from a
> >> more
> >> recent build of GCC):
> >>
> >> Testing complex/fast-math-complex-add-pattern-half-float.c
> >> doing compile
> >> Executing on host: /build/r16-7849-g1f9879e17466f5/bin/arm-none-eabi-gcc
> >> /build/gcc_src/gcc/testsuite/gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
> >> -mthumb -march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto
> >> -fdiagnostics-plain-output -mfloat-abi=softfp -mcpu=unset
> >> -march=armv7-a+simd -mfpu=auto -ffast-math -ftree-vectorize
> >> -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2
> >> -fdump-tree-vect-details -ffast-math -mfloat-abi=softfp -mfpu=auto
> >> -mcpu=unset
> >> -march=armv8.3-a+fp16+simd -S -o
> >> fast-math-complex-add-pattern-half-float.s (timeout = 800)
> >> spawn -ignore SIGHUP /build/r16-7849-g1f9879e17466f5/bin/arm-none-eabi-gcc
> >> /build/gcc_src/gcc/testsuite/gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
> >> -mthumb -march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto
> >> -fdiagnostics-plain-output -mfloat-abi=softfp -mcpu=unset
> >> -march=armv7-a+simd
> >> -mfpu=auto -ffast-math -ftree-vectorize -fno-tree-loop-distribute-patterns
> >> -fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details -ffast-math
> >> -mfloat-abi=softfp -mfpu=auto -mcpu=unset -march=armv8.3-a+fp16+simd -S -o
> >> fast-math-complex-add-pattern-half-float.s
> >> pid is 2955587 -2955587
> >> pid is -1
> >> output is status 0
> >> PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c (test
> >> for
> >> excess errors)
> >> gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c: pattern
> >> found
> >> 4 times
> >> FAIL: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
> >> scan-tree-dump-times vect "add new stmt: [^\n\r]*COMPLEX_ADD_ROT90" 3
> >> PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
> >> PASS: scan-tree-dump-times vect "add new stmt: [^\n\r]*COMPLEX_ADD_ROT270"
> >> PASS: 1
> >> PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
> >> PASS: scan-tree-dump vect "Found COMPLEX_ADD_ROT270"
> >> PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
> >> PASS: scan-tree-dump vect "Found COMPLEX_ADD_ROT90"
> >>
> >> Should the test be updated to accept 4 pattern matches or is it wrong to
> >> have
> >> 4 matches instead of 3?
> >
> > We see epilogue vectorization here. The best way forward is probably
> > to add --param vect-epilogues-nomask=0 to dg-additional-options in that
> > testcase. I verified this fixes the issue on aarch64 with
> > -march=armv8.3-a.
> >
> > Can you test arm-none-eabi? arm and aarch64 are the only targets
> > enabled by vect_complex_add_half.
>
> Adding "--param vect-epilogues-nomask=0" to dg-additional-options works fine
> for the targets that I test on arm-none-eabi.
>
> Do you want me to send a patch with this or will you handle it?
I'll push it.
Richard.
> Kind regards,
> Torbjörn
>
> >
> > Richard.
> >
> >> Kind regards,
> >> Torbjörn
> >>
> >> On 2026-01-14 12:52, Richard Biener wrote:
> >>> The following adjusts the condition where we reject vectorization
> >>> because the scalar loop runs only for a single iteration (or two,
> >>> in case we need to peel for gaps). Because this is over-eager
> >>> when considering the case of VF == 1 where instead the cost model
> >>> should decide wheter it is worthwhile or not. I'm playing
> >>> conservative here and exclude the case of two iterations as I
> >>> do not have benchmark evidence.
> >>>
> >>> This helps fixing a regression observed with improved SLP handling,
> >>> not exactly for the options used in the PR though, but for a more
> >>> common -O3 -march=x86-64-v3 this speeds up 433.milc by 6%.
> >>>
> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu, will push later.
> >>>
> >>> PR tree-optimization/123190
> >>> * tree-vect-loop.cc (vect_analyze_loop_costing): Allow
> >>> vectorizing loops with a single scalar iteration iff the
> >>> vectorization factor is 1.
> >>>
> >>> * gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c: New testcase.
> >>> * gcc.dg/vect/slp-28.c: Avoid epilogue vectorization for
> >>> simplicity.
> >>> ---
> >>> .../costmodel/x86_64/costmodel-pr123190-1.c | 38 +++++++++++++++++++
> >>> gcc/testsuite/gcc.dg/vect/slp-28.c | 1 +
> >>> gcc/tree-vect-loop.cc | 8 +++-
> >>> 3 files changed, 45 insertions(+), 2 deletions(-)
> >>> create mode 100644
> >>> gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
> >>>
> >>> diff --git
> >>> a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
> >>> b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
> >>> new file mode 100644
> >>> index 00000000000..4265ac80a43
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
> >>> @@ -0,0 +1,38 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-additional-options "-O3 -mavx2 -mno-avx512f -mtune=generic" } */
> >>> +
> >>> +typedef struct {
> >>> + double real;
> >>> + double imag;
> >>> +} complex;
> >>> +
> >>> +typedef struct { complex e[3][3]; } su3_matrix;
> >>> +
> >>> +void mult_su3_na( su3_matrix *a, su3_matrix *b, su3_matrix *c ){
> >>> +int i,j;
> >>> +register double t,ar,ai,br,bi,cr,ci;
> >>> + for(i=0;i<3;i++)
> >>> + for(j=0;j<3;j++){
> >>> +
> >>> + ar=a->e[i][0].real; ai=a->e[i][0].imag;
> >>> + br=b->e[j][0].real; bi=b->e[j][0].imag;
> >>> + cr=ar*br; t=ai*bi; cr += t;
> >>> + ci=ai*br; t=ar*bi; ci -= t;
> >>> +
> >>> + ar=a->e[i][1].real; ai=a->e[i][1].imag;
> >>> + br=b->e[j][1].real; bi=b->e[j][1].imag;
> >>> + t=ar*br; cr += t; t=ai*bi; cr += t;
> >>> + t=ar*bi; ci -= t; t=ai*br; ci += t;
> >>> +
> >>> + ar=a->e[i][2].real; ai=a->e[i][2].imag;
> >>> + br=b->e[j][2].real; bi=b->e[j][2].imag;
> >>> + t=ar*br; cr += t; t=ai*bi; cr += t;
> >>> + t=ar*bi; ci -= t; t=ai*br; ci += t;
> >>> +
> >>> + c->e[i][j].real=cr;
> >>> + c->e[i][j].imag=ci;
> >>> + }
> >>> +}
> >>> +
> >>> +/* { dg-final { scan-tree-dump "optimized: loop vectorized using 32"
> >>> "vect"
> >>> } } */
> >>> +/* { dg-final { scan-tree-dump "optimized: epilogue loop vectorized using
> >>> 16 byte vectors and unroll factor 1" "vect" } } */
> >>> diff --git a/gcc/testsuite/gcc.dg/vect/slp-28.c
> >>> b/gcc/testsuite/gcc.dg/vect/slp-28.c
> >>> index 1f987874f0d..bf6271eed25 100644
> >>> --- a/gcc/testsuite/gcc.dg/vect/slp-28.c
> >>> +++ b/gcc/testsuite/gcc.dg/vect/slp-28.c
> >>> @@ -1,4 +1,5 @@
> >>> /* { dg-require-effective-target vect_int } */
> >>> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> >>>
> >>> #include <stdarg.h>
> >>> #include "tree-vect.h"
> >>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> >>> index 74eecb832e6..fdf544fa47b 100644
> >>> --- a/gcc/tree-vect-loop.cc
> >>> +++ b/gcc/tree-vect-loop.cc
> >>> @@ -1792,9 +1792,13 @@ vect_analyze_loop_costing (loop_vec_info
> >>> loop_vinfo,
> >>> }
> >>> }
> >>> /* Reject vectorizing for a single scalar iteration, even if
> >>> - we could in principle implement that using partial vectors. */
> >>> + we could in principle implement that using partial vectors.
> >>> + But allow such vectorization if VF == 1 in case we do not
> >>> + need to peel for gaps (if we need, avoid vectorization for
> >>> + reasons of code footprint). */
> >>> unsigned peeling_gap = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
> >>> - if (scalar_niters <= peeling_gap + 1)
> >>> + if (scalar_niters <= peeling_gap + 1
> >>> + && (assumed_vf > 1 || peeling_gap != 0))
> >>> {
> >>> if (dump_enabled_p ())
> >>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >>
> >>
> >>
> >
>
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)