Richard Biener <richard.guent...@gmail.com> writes: > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazy...@gmail.com> wrote: >> >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote: >> > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener >> > <richard.guent...@gmail.com> wrote: >> > > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote: >> > > > >> > > > GCC12 enables vectorization for O2 with very cheap cost model which is >> > > > restricted >> > > > to constant tripcount. The vectorization capacity is very limited w/ >> > > > consideration >> > > > of codesize impact. >> > > > >> > > > The patch extends the very cheap cost model a little bit to support >> > > > variable tripcount. >> > > > But still disable peeling for gaps/alignment, runtime aliasing >> > > > checking and epilogue >> > > > vectorization with the consideration of codesize. >> > > > >> > > > So there're at most 2 versions of loop for O2 vectorization, one >> > > > vectorized main loop >> > > > , one scalar/remainder loop. >> > > > >> > > > .i.e. >> > > > >> > > > void >> > > > foo1 (int* __restrict a, int* b, int* c, int n) >> > > > { >> > > > for (int i = 0; i != n; i++) >> > > > a[i] = b[i] + c[i]; >> > > > } >> > > > >> > > > with -O2 -march=x86-64-v3, will be vectorized to >> > > > >> > > > .L10: >> > > > vmovdqu (%r8,%rax), %ymm0 >> > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 >> > > > vmovdqu %ymm0, (%rdi,%rax) >> > > > addq $32, %rax >> > > > cmpq %rdx, %rax >> > > > jne .L10 >> > > > movl %ecx, %eax >> > > > andl $-8, %eax >> > > > cmpl %eax, %ecx >> > > > je .L21 >> > > > vzeroupper >> > > > .L12: >> > > > movl (%r8,%rax,4), %edx >> > > > addl (%rsi,%rax,4), %edx >> > > > movl %edx, (%rdi,%rax,4) >> > > > addq $1, %rax >> > > > cmpl %eax, %ecx >> > > > jne .L12 >> > > > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves >> > > > performance by 4.11% >> > > > with extra 2.8% codeisze, and cheap cost model improve performance by >> > > > 5.74% with >> > > > extra 8.88% codesize. The details are as below >> > > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost >> > > model numbers? >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base. >> > > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR >> > > > >> > > > N-Iter cheap cost model >> > > > 500.perlbench_r -0.12% -0.12% >> > > > 502.gcc_r 0.44% -0.11% >> > > > 505.mcf_r 0.17% 4.46% >> > > > 520.omnetpp_r 0.28% -0.27% >> > > > 523.xalancbmk_r 0.00% 5.93% >> > > > 525.x264_r -0.09% 23.53% >> > > > 531.deepsjeng_r 0.19% 0.00% >> > > > 541.leela_r 0.22% 0.00% >> > > > 548.exchange2_r -11.54% -22.34% >> > > > 557.xz_r 0.74% 0.49% >> > > > GEOMEAN INT -1.04% 0.60% >> > > > >> > > > 503.bwaves_r 3.13% 4.72% >> > > > 507.cactuBSSN_r 1.17% 0.29% >> > > > 508.namd_r 0.39% 6.87% >> > > > 510.parest_r 3.14% 8.52% >> > > > 511.povray_r 0.10% -0.20% >> > > > 519.lbm_r -0.68% 10.14% >> > > > 521.wrf_r 68.20% 76.73% >> > > >> > > So this seems to regress as well? >> > Niter increases performance less than the cheap cost model, that's >> > expected, it is not a regression. >> > > >> > > > 526.blender_r 0.12% 0.12% >> > > > 527.cam4_r 19.67% 23.21% >> > > > 538.imagick_r 0.12% 0.24% >> > > > 544.nab_r 0.63% 0.53% >> > > > 549.fotonik3d_r 14.44% 9.43% >> > > > 554.roms_r 12.39% 0.00% >> > > > GEOMEAN FP 8.26% 9.41% >> > > > GEOMEAN ALL 4.11% 5.74% >> >> I've tested the patch on aarch64, it shows similar improvement with >> little codesize increasement. >> I haven't tested it on other backends, but I think it would have >> similar good improvements > > I think overall this is expected since a constant niter dividable by > the VF isn't a common situation. So the question is mostly whether > we want to pay the size penalty or not. > > Looking only at docs the proposed change would make the very-cheap > cost model nearly(?) equivalent to the cheap one so maybe the answer > is to default to cheap rather than very-cheap? One difference seems to > be that cheap allows alias versioning.
I remember seeing cases in the past where we could generate an excessive number of alias checks. The cost model didn't account for them very well, since the checks often became a fixed overhead for all paths (both scalar and vector), especially if the checks were fully if-converted, with one branch at the end. The relevant comparison is then between the original pre-vectorisation scalar code and the code with alias checks, rather than between post-vectorisation scalar code and post-vectorisation vector code. Things might be better now though. FTR, I don't object to relaxing the -O2 model. It was deliberately conservative, for a time when enabling vectorisation at -O2 was somewhat controversial. It was also heavily influenced by SVE, where variable trip counts are not an issue. The proposal would also make GCC's behaviour more similar to Clang's. Thanks, Richard