On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote:
>
> GCC12 enables vectorization for O2 with very cheap cost model which is 
> restricted
> to constant tripcount. The vectorization capacity is very limited w/ 
> consideration
> of codesize impact.
>
> The patch extends the very cheap cost model a little bit to support variable 
> tripcount.
> But still disable peeling for gaps/alignment, runtime aliasing checking and 
> epilogue
> vectorization with the consideration of codesize.
>
> So there're at most 2 versions of loop for O2 vectorization, one vectorized 
> main loop
> , one scalar/remainder loop.
>
> .i.e.
>
> void
> foo1 (int* __restrict a, int* b, int* c, int n)
> {
>  for (int i = 0; i != n; i++)
>   a[i] = b[i] + c[i];
> }
>
> with -O2 -march=x86-64-v3, will be vectorized to
>
> .L10:
>         vmovdqu (%r8,%rax), %ymm0
>         vpaddd  (%rsi,%rax), %ymm0, %ymm0
>         vmovdqu %ymm0, (%rdi,%rax)
>         addq    $32, %rax
>         cmpq    %rdx, %rax
>         jne     .L10
>         movl    %ecx, %eax
>         andl    $-8, %eax
>         cmpl    %eax, %ecx
>         je      .L21
>         vzeroupper
> .L12:
>         movl    (%r8,%rax,4), %edx
>         addl    (%rsi,%rax,4), %edx
>         movl    %edx, (%rdi,%rax,4)
>         addq    $1, %rax
>         cmpl    %eax, %ecx
>         jne     .L12
>
> As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 
> 4.11%
> with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% 
> with
> extra 8.88% codesize. The details are as below

I'm confused by this, is the N-Iter numbers ontop of the cheap cost
model numbers?

> Performance measured with -march=x86-64-v3 -O2 on EMR
>
>                     N-Iter      cheap cost model
> 500.perlbench_r     -0.12%      -0.12%
> 502.gcc_r           0.44%       -0.11%
> 505.mcf_r           0.17%       4.46%
> 520.omnetpp_r       0.28%       -0.27%
> 523.xalancbmk_r     0.00%       5.93%
> 525.x264_r          -0.09%      23.53%
> 531.deepsjeng_r     0.19%       0.00%
> 541.leela_r         0.22%       0.00%
> 548.exchange2_r     -11.54%     -22.34%
> 557.xz_r            0.74%       0.49%
> GEOMEAN INT         -1.04%      0.60%
>
> 503.bwaves_r        3.13%       4.72%
> 507.cactuBSSN_r     1.17%       0.29%
> 508.namd_r          0.39%       6.87%
> 510.parest_r        3.14%       8.52%
> 511.povray_r        0.10%       -0.20%
> 519.lbm_r           -0.68%      10.14%
> 521.wrf_r           68.20%      76.73%

So this seems to regress as well?

> 526.blender_r       0.12%       0.12%
> 527.cam4_r          19.67%      23.21%
> 538.imagick_r       0.12%       0.24%
> 544.nab_r           0.63%       0.53%
> 549.fotonik3d_r     14.44%      9.43%
> 554.roms_r          12.39%      0.00%
> GEOMEAN FP          8.26%       9.41%
> GEOMEAN ALL         4.11%       5.74%
>
> Code sise impact
>                     N-Iter      cheap cost model
> 500.perlbench_r     0.22%       1.03%
> 502.gcc_r           0.25%       0.60%
> 505.mcf_r           0.00%       32.07%
> 520.omnetpp_r       0.09%       0.31%
> 523.xalancbmk_r     0.08%       1.86%
> 525.x264_r          0.75%       7.96%
> 531.deepsjeng_r     0.72%       3.28%
> 541.leela_r         0.18%       0.75%
> 548.exchange2_r     8.29%       12.19%
> 557.xz_r            0.40%       0.60%
> GEOMEAN INT         1.07%%      5.71%
>
> 503.bwaves_r        12.89%      21.59%
> 507.cactuBSSN_r     0.90%       20.19%
> 508.namd_r          0.77%       14.75%
> 510.parest_r        0.91%       3.91%
> 511.povray_r        0.45%       4.08%
> 519.lbm_r           0.00%       0.00%
> 521.wrf_r           5.97%       12.79%
> 526.blender_r       0.49%       3.84%
> 527.cam4_r          1.39%       3.28%
> 538.imagick_r       1.86%       7.78%
> 544.nab_r           0.41%       3.00%
> 549.fotonik3d_r     25.50%      47.47%
> 554.roms_r          5.17%       13.01%
> GEOMEAN FP          4.14%       11.38%
> GEOMEAN ALL         2.80%       8.88%
>
>
> The only regression is from 548.exchange_r, the vectorization for inner loop 
> in each layer
> of the 9-layer loops increases register pressure and causes more spill.
> - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
>   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
>     .....
>         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
>     ...
> - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
>
> Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but 
> x86 only has 16.
> I have a extra patch to prevent loop vectorization in deep-depth loop for x86 
> backend which can
> bring the performance back.
>
> For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model 
> increases codesize
> a lot but don't imporve any performance. And N-iter is much better for that 
> for codesize.
>
>
> Any comments?
>
>
> gcc/ChangeLog:
>
>         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
>         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
>         cost model.
>         (vect_analyze_loop): Disable epilogue vectorization in very
>         cheap cost model.
> ---
>  gcc/tree-vect-loop.cc | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 242d5e2d916..06afd8cae79 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
>       a copy of the scalar code (even if we might be able to vectorize it).  
> */
>    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
>        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))

I notice that we should probably not call
vect_enhance_data_refs_alignment because
when alignment peeling is optional we should avoid it rather than disabling the
vectorization completely.

Also if you allow peeling for niter then there's no good reason to not
allow peeling
for gaps (or any other epilogue peeling).

The extra cost for niter peeling is a runtime check before the loop
which would also
happen (plus keeping the scalar copy) when there's a runtime cost check.  That
also means versioning for alias/alignment could be allowed if it
shares the scalar
loop with the epilogue (I don't remember the constraints we set in place for the
sharing).

Richard.

>      {
>        if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
> *shared)
>                            /* No code motion support for multiple epilogues 
> so for now
>                               not supported when multiple exits.  */
>                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> -                        && !loop->simduid);
> +                        && !loop->simduid
> +                        && loop_cost_model (loop) > 
> VECT_COST_MODEL_VERY_CHEAP);
>    if (!vect_epilogues)
>      return first_loop_vinfo;
>
> --
> 2.31.1
>

Reply via email to