GCC12 enables vectorization for O2 with very cheap cost model which is restricted to constant tripcount. The vectorization capacity is very limited w/ consideration of codesize impact.
The patch extends the very cheap cost model a little bit to support variable tripcount. But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue vectorization with the consideration of codesize. So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop , one scalar/remainder loop. .i.e. void foo1 (int* __restrict a, int* b, int* c, int n) { for (int i = 0; i != n; i++) a[i] = b[i] + c[i]; } with -O2 -march=x86-64-v3, will be vectorized to .L10: vmovdqu (%r8,%rax), %ymm0 vpaddd (%rsi,%rax), %ymm0, %ymm0 vmovdqu %ymm0, (%rdi,%rax) addq $32, %rax cmpq %rdx, %rax jne .L10 movl %ecx, %eax andl $-8, %eax cmpl %eax, %ecx je .L21 vzeroupper .L12: movl (%r8,%rax,4), %edx addl (%rsi,%rax,4), %edx movl %edx, (%rdi,%rax,4) addq $1, %rax cmpl %eax, %ecx jne .L12 As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with extra 8.88% codesize. The details are as below Performance measured with -march=x86-64-v3 -O2 on EMR N-Iter cheap cost model 500.perlbench_r -0.12% -0.12% 502.gcc_r 0.44% -0.11% 505.mcf_r 0.17% 4.46% 520.omnetpp_r 0.28% -0.27% 523.xalancbmk_r 0.00% 5.93% 525.x264_r -0.09% 23.53% 531.deepsjeng_r 0.19% 0.00% 541.leela_r 0.22% 0.00% 548.exchange2_r -11.54% -22.34% 557.xz_r 0.74% 0.49% GEOMEAN INT -1.04% 0.60% 503.bwaves_r 3.13% 4.72% 507.cactuBSSN_r 1.17% 0.29% 508.namd_r 0.39% 6.87% 510.parest_r 3.14% 8.52% 511.povray_r 0.10% -0.20% 519.lbm_r -0.68% 10.14% 521.wrf_r 68.20% 76.73% 526.blender_r 0.12% 0.12% 527.cam4_r 19.67% 23.21% 538.imagick_r 0.12% 0.24% 544.nab_r 0.63% 0.53% 549.fotonik3d_r 14.44% 9.43% 554.roms_r 12.39% 0.00% GEOMEAN FP 8.26% 9.41% GEOMEAN ALL 4.11% 5.74% Code sise impact N-Iter cheap cost model 500.perlbench_r 0.22% 1.03% 502.gcc_r 0.25% 0.60% 505.mcf_r 0.00% 32.07% 520.omnetpp_r 0.09% 0.31% 523.xalancbmk_r 0.08% 1.86% 525.x264_r 0.75% 7.96% 531.deepsjeng_r 0.72% 3.28% 541.leela_r 0.18% 0.75% 548.exchange2_r 8.29% 12.19% 557.xz_r 0.40% 0.60% GEOMEAN INT 1.07%% 5.71% 503.bwaves_r 12.89% 21.59% 507.cactuBSSN_r 0.90% 20.19% 508.namd_r 0.77% 14.75% 510.parest_r 0.91% 3.91% 511.povray_r 0.45% 4.08% 519.lbm_r 0.00% 0.00% 521.wrf_r 5.97% 12.79% 526.blender_r 0.49% 3.84% 527.cam4_r 1.39% 3.28% 538.imagick_r 1.86% 7.78% 544.nab_r 0.41% 3.00% 549.fotonik3d_r 25.50% 47.47% 554.roms_r 5.17% 13.01% GEOMEAN FP 4.14% 11.38% GEOMEAN ALL 2.80% 8.88% The only regression is from 548.exchange_r, the vectorization for inner loop in each layer of the 9-layer loops increases register pressure and causes more spill. - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 ..... - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 ... - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16. I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can bring the performance back. For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize a lot but don't imporve any performance. And N-iter is much better for that for codesize. Any comments? gcc/ChangeLog: * tree-vect-loop.cc (vect_analyze_loop_costing): Enable vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap cost model. (vect_analyze_loop): Disable epilogue vectorization in very cheap cost model. --- gcc/tree-vect-loop.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 242d5e2d916..06afd8cae79 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, a copy of the scalar code (even if we might be able to vectorize it). */ if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* No code motion support for multiple epilogues so for now not supported when multiple exits. */ && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) - && !loop->simduid); + && !loop->simduid + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); if (!vect_epilogues) return first_loop_vinfo; -- 2.31.1