On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote: > > GCC12 enables vectorization for O2 with very cheap cost model which is > restricted > to constant tripcount. The vectorization capacity is very limited w/ > consideration > of codesize impact. > > The patch extends the very cheap cost model a little bit to support variable > tripcount. > But still disable peeling for gaps/alignment, runtime aliasing checking and > epilogue > vectorization with the consideration of codesize. > > So there're at most 2 versions of loop for O2 vectorization, one vectorized > main loop > , one scalar/remainder loop. > > .i.e. > > void > foo1 (int* __restrict a, int* b, int* c, int n) > { > for (int i = 0; i != n; i++) > a[i] = b[i] + c[i]; > } > > with -O2 -march=x86-64-v3, will be vectorized to > > .L10: > vmovdqu (%r8,%rax), %ymm0 > vpaddd (%rsi,%rax), %ymm0, %ymm0 > vmovdqu %ymm0, (%rdi,%rax) > addq $32, %rax > cmpq %rdx, %rax > jne .L10 > movl %ecx, %eax > andl $-8, %eax > cmpl %eax, %ecx > je .L21 > vzeroupper > .L12: > movl (%r8,%rax,4), %edx > addl (%rsi,%rax,4), %edx > movl %edx, (%rdi,%rax,4) > addq $1, %rax > cmpl %eax, %ecx > jne .L12 > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by > 4.11% > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% > with > extra 8.88% codesize. The details are as below
I'm confused by this, is the N-Iter numbers ontop of the cheap cost model numbers? > Performance measured with -march=x86-64-v3 -O2 on EMR > > N-Iter cheap cost model > 500.perlbench_r -0.12% -0.12% > 502.gcc_r 0.44% -0.11% > 505.mcf_r 0.17% 4.46% > 520.omnetpp_r 0.28% -0.27% > 523.xalancbmk_r 0.00% 5.93% > 525.x264_r -0.09% 23.53% > 531.deepsjeng_r 0.19% 0.00% > 541.leela_r 0.22% 0.00% > 548.exchange2_r -11.54% -22.34% > 557.xz_r 0.74% 0.49% > GEOMEAN INT -1.04% 0.60% > > 503.bwaves_r 3.13% 4.72% > 507.cactuBSSN_r 1.17% 0.29% > 508.namd_r 0.39% 6.87% > 510.parest_r 3.14% 8.52% > 511.povray_r 0.10% -0.20% > 519.lbm_r -0.68% 10.14% > 521.wrf_r 68.20% 76.73% So this seems to regress as well? > 526.blender_r 0.12% 0.12% > 527.cam4_r 19.67% 23.21% > 538.imagick_r 0.12% 0.24% > 544.nab_r 0.63% 0.53% > 549.fotonik3d_r 14.44% 9.43% > 554.roms_r 12.39% 0.00% > GEOMEAN FP 8.26% 9.41% > GEOMEAN ALL 4.11% 5.74% > > Code sise impact > N-Iter cheap cost model > 500.perlbench_r 0.22% 1.03% > 502.gcc_r 0.25% 0.60% > 505.mcf_r 0.00% 32.07% > 520.omnetpp_r 0.09% 0.31% > 523.xalancbmk_r 0.08% 1.86% > 525.x264_r 0.75% 7.96% > 531.deepsjeng_r 0.72% 3.28% > 541.leela_r 0.18% 0.75% > 548.exchange2_r 8.29% 12.19% > 557.xz_r 0.40% 0.60% > GEOMEAN INT 1.07%% 5.71% > > 503.bwaves_r 12.89% 21.59% > 507.cactuBSSN_r 0.90% 20.19% > 508.namd_r 0.77% 14.75% > 510.parest_r 0.91% 3.91% > 511.povray_r 0.45% 4.08% > 519.lbm_r 0.00% 0.00% > 521.wrf_r 5.97% 12.79% > 526.blender_r 0.49% 3.84% > 527.cam4_r 1.39% 3.28% > 538.imagick_r 1.86% 7.78% > 544.nab_r 0.41% 3.00% > 549.fotonik3d_r 25.50% 47.47% > 554.roms_r 5.17% 13.01% > GEOMEAN FP 4.14% 11.38% > GEOMEAN ALL 2.80% 8.88% > > > The only regression is from 548.exchange_r, the vectorization for inner loop > in each layer > of the 9-layer loops increases register pressure and causes more spill. > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > ..... > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > ... > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but > x86 only has 16. > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 > backend which can > bring the performance back. > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model > increases codesize > a lot but don't imporve any performance. And N-iter is much better for that > for codesize. > > > Any comments? > > > gcc/ChangeLog: > > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap > cost model. > (vect_analyze_loop): Disable epilogue vectorization in very > cheap cost model. > --- > gcc/tree-vect-loop.cc | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 242d5e2d916..06afd8cae79 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, > a copy of the scalar code (even if we might be able to vectorize it). > */ > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) I notice that we should probably not call vect_enhance_data_refs_alignment because when alignment peeling is optional we should avoid it rather than disabling the vectorization completely. Also if you allow peeling for niter then there's no good reason to not allow peeling for gaps (or any other epilogue peeling). The extra cost for niter peeling is a runtime check before the loop which would also happen (plus keeping the scalar copy) when there's a runtime cost check. That also means versioning for alias/alignment could be allowed if it shares the scalar loop with the epilogue (I don't remember the constraints we set in place for the sharing). Richard. > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared > *shared) > /* No code motion support for multiple epilogues > so for now > not supported when multiple exits. */ > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) > - && !loop->simduid); > + && !loop->simduid > + && loop_cost_model (loop) > > VECT_COST_MODEL_VERY_CHEAP); > if (!vect_epilogues) > return first_loop_vinfo; > > -- > 2.31.1 >