GCC12 enables vectorization for O2 with very cheap cost model which is 
restricted
to constant tripcount. The vectorization capacity is very limited w/ 
consideration
of codesize impact.

The patch extends the very cheap cost model a little bit to support variable 
tripcount.
But still disable peeling for gaps/alignment, runtime aliasing checking and 
epilogue
vectorization with the consideration of codesize.

So there're at most 2 versions of loop for O2 vectorization, one vectorized 
main loop
, one scalar/remainder loop.

.i.e.

void
foo1 (int* __restrict a, int* b, int* c, int n)
{
 for (int i = 0; i != n; i++)
  a[i] = b[i] + c[i];
}

with -O2 -march=x86-64-v3, will be vectorized to

.L10:
        vmovdqu (%r8,%rax), %ymm0
        vpaddd  (%rsi,%rax), %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L10
        movl    %ecx, %eax
        andl    $-8, %eax
        cmpl    %eax, %ecx
        je      .L21
        vzeroupper
.L12:
        movl    (%r8,%rax,4), %edx
        addl    (%rsi,%rax,4), %edx
        movl    %edx, (%rdi,%rax,4)
        addq    $1, %rax
        cmpl    %eax, %ecx
        jne     .L12

As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 
4.11%
with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
extra 8.88% codesize. The details are as below

Performance measured with -march=x86-64-v3 -O2 on EMR

                    N-Iter      cheap cost model
500.perlbench_r     -0.12%      -0.12%
502.gcc_r           0.44%       -0.11%  
505.mcf_r           0.17%       4.46%
520.omnetpp_r       0.28%       -0.27%
523.xalancbmk_r     0.00%       5.93%
525.x264_r          -0.09%      23.53%
531.deepsjeng_r     0.19%       0.00%
541.leela_r         0.22%       0.00%
548.exchange2_r     -11.54%     -22.34%
557.xz_r            0.74%       0.49%
GEOMEAN INT         -1.04%      0.60%

503.bwaves_r        3.13%       4.72%
507.cactuBSSN_r     1.17%       0.29%
508.namd_r          0.39%       6.87%
510.parest_r        3.14%       8.52%
511.povray_r        0.10%       -0.20%
519.lbm_r           -0.68%      10.14%
521.wrf_r           68.20%      76.73%
526.blender_r       0.12%       0.12%
527.cam4_r          19.67%      23.21%
538.imagick_r       0.12%       0.24%
544.nab_r           0.63%       0.53%
549.fotonik3d_r     14.44%      9.43%
554.roms_r          12.39%      0.00%
GEOMEAN FP          8.26%       9.41%
GEOMEAN ALL         4.11%       5.74%

Code sise impact
                    N-Iter      cheap cost model
500.perlbench_r     0.22%       1.03%
502.gcc_r           0.25%       0.60%   
505.mcf_r           0.00%       32.07%
520.omnetpp_r       0.09%       0.31%
523.xalancbmk_r     0.08%       1.86%
525.x264_r          0.75%       7.96%
531.deepsjeng_r     0.72%       3.28%
541.leela_r         0.18%       0.75%
548.exchange2_r     8.29%       12.19%
557.xz_r            0.40%       0.60%
GEOMEAN INT         1.07%%      5.71%

503.bwaves_r        12.89%      21.59%
507.cactuBSSN_r     0.90%       20.19%
508.namd_r          0.77%       14.75%
510.parest_r        0.91%       3.91%
511.povray_r        0.45%       4.08%
519.lbm_r           0.00%       0.00%
521.wrf_r           5.97%       12.79%
526.blender_r       0.49%       3.84%
527.cam4_r          1.39%       3.28%
538.imagick_r       1.86%       7.78%
544.nab_r           0.41%       3.00%
549.fotonik3d_r     25.50%      47.47%
554.roms_r          5.17%       13.01%
GEOMEAN FP          4.14%       11.38%
GEOMEAN ALL         2.80%       8.88%


The only regression is from 548.exchange_r, the vectorization for inner loop in 
each layer
of the 9-layer loops increases register pressure and causes more spill.
- block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
  - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
    .....
        - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
    ...
- block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
- block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10

Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 
only has 16.
I have a extra patch to prevent loop vectorization in deep-depth loop for x86 
backend which can
bring the performance back.

For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model 
increases codesize
a lot but don't imporve any performance. And N-iter is much better for that for 
codesize.


Any comments?


gcc/ChangeLog:

        * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
        vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
        cost model.
        (vect_analyze_loop): Disable epilogue vectorization in very
        cheap cost model.
---
 gcc/tree-vect-loop.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 242d5e2d916..06afd8cae79 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
      a copy of the scalar code (even if we might be able to vectorize it).  */
   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
-         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
+         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
                           /* No code motion support for multiple epilogues so 
for now
                              not supported when multiple exits.  */
                         && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
-                        && !loop->simduid);
+                        && !loop->simduid
+                        && loop_cost_model (loop) > 
VECT_COST_MODEL_VERY_CHEAP);
   if (!vect_epilogues)
     return first_loop_vinfo;
 
-- 
2.31.1

Reply via email to