On Thu, Sep 19, 2024 at 2:08 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford > <richard.sandif...@arm.com> wrote: > > > > Richard Biener <richard.guent...@gmail.com> writes: > > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazy...@gmail.com> wrote: > > >> > > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote: > > >> > > > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > > >> > <richard.guent...@gmail.com> wrote: > > >> > > > > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> > > >> > > wrote: > > >> > > > > > >> > > > GCC12 enables vectorization for O2 with very cheap cost model > > >> > > > which is restricted > > >> > > > to constant tripcount. The vectorization capacity is very limited > > >> > > > w/ consideration > > >> > > > of codesize impact. > > >> > > > > > >> > > > The patch extends the very cheap cost model a little bit to > > >> > > > support variable tripcount. > > >> > > > But still disable peeling for gaps/alignment, runtime aliasing > > >> > > > checking and epilogue > > >> > > > vectorization with the consideration of codesize. > > >> > > > > > >> > > > So there're at most 2 versions of loop for O2 vectorization, one > > >> > > > vectorized main loop > > >> > > > , one scalar/remainder loop. > > >> > > > > > >> > > > .i.e. > > >> > > > > > >> > > > void > > >> > > > foo1 (int* __restrict a, int* b, int* c, int n) > > >> > > > { > > >> > > > for (int i = 0; i != n; i++) > > >> > > > a[i] = b[i] + c[i]; > > >> > > > } > > >> > > > > > >> > > > with -O2 -march=x86-64-v3, will be vectorized to > > >> > > > > > >> > > > .L10: > > >> > > > vmovdqu (%r8,%rax), %ymm0 > > >> > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > >> > > > vmovdqu %ymm0, (%rdi,%rax) > > >> > > > addq $32, %rax > > >> > > > cmpq %rdx, %rax > > >> > > > jne .L10 > > >> > > > movl %ecx, %eax > > >> > > > andl $-8, %eax > > >> > > > cmpl %eax, %ecx > > >> > > > je .L21 > > >> > > > vzeroupper > > >> > > > .L12: > > >> > > > movl (%r8,%rax,4), %edx > > >> > > > addl (%rsi,%rax,4), %edx > > >> > > > movl %edx, (%rdi,%rax,4) > > >> > > > addq $1, %rax > > >> > > > cmpl %eax, %ecx > > >> > > > jne .L12 > > >> > > > > > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves > > >> > > > performance by 4.11% > > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance > > >> > > > by 5.74% with > > >> > > > extra 8.88% codesize. The details are as below > > >> > > > > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > >> > > model numbers? > > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > >> > > > > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > >> > > > > > >> > > > N-Iter cheap cost model > > >> > > > 500.perlbench_r -0.12% -0.12% > > >> > > > 502.gcc_r 0.44% -0.11% > > >> > > > 505.mcf_r 0.17% 4.46% > > >> > > > 520.omnetpp_r 0.28% -0.27% > > >> > > > 523.xalancbmk_r 0.00% 5.93% > > >> > > > 525.x264_r -0.09% 23.53% > > >> > > > 531.deepsjeng_r 0.19% 0.00% > > >> > > > 541.leela_r 0.22% 0.00% > > >> > > > 548.exchange2_r -11.54% -22.34% > > >> > > > 557.xz_r 0.74% 0.49% > > >> > > > GEOMEAN INT -1.04% 0.60% > > >> > > > > > >> > > > 503.bwaves_r 3.13% 4.72% > > >> > > > 507.cactuBSSN_r 1.17% 0.29% > > >> > > > 508.namd_r 0.39% 6.87% > > >> > > > 510.parest_r 3.14% 8.52% > > >> > > > 511.povray_r 0.10% -0.20% > > >> > > > 519.lbm_r -0.68% 10.14% > > >> > > > 521.wrf_r 68.20% 76.73% > > >> > > > > >> > > So this seems to regress as well? > > >> > Niter increases performance less than the cheap cost model, that's > > >> > expected, it is not a regression. > > >> > > > > >> > > > 526.blender_r 0.12% 0.12% > > >> > > > 527.cam4_r 19.67% 23.21% > > >> > > > 538.imagick_r 0.12% 0.24% > > >> > > > 544.nab_r 0.63% 0.53% > > >> > > > 549.fotonik3d_r 14.44% 9.43% > > >> > > > 554.roms_r 12.39% 0.00% > > >> > > > GEOMEAN FP 8.26% 9.41% > > >> > > > GEOMEAN ALL 4.11% 5.74% > > >> > > >> I've tested the patch on aarch64, it shows similar improvement with > > >> little codesize increasement. > > >> I haven't tested it on other backends, but I think it would have > > >> similar good improvements > > > > > > I think overall this is expected since a constant niter dividable by > > > the VF isn't a common situation. So the question is mostly whether > > > we want to pay the size penalty or not. > > > > > > Looking only at docs the proposed change would make the very-cheap > > > cost model nearly(?) equivalent to the cheap one so maybe the answer > > > is to default to cheap rather than very-cheap? One difference seems to > > > be that cheap allows alias versioning. > > > > I remember seeing cases in the past where we could generate an > > excessive number of alias checks. The cost model didn't account > > for them very well, since the checks often became a fixed overhead > > for all paths (both scalar and vector), especially if the checks > > were fully if-converted, with one branch at the end. The relevant > > comparison is then between the original pre-vectorisation scalar code > > and the code with alias checks, rather than between post-vectorisation > > scalar code and post-vectorisation vector code. Things might be better > > now though. > > Yes, the cost model (aka niter) check should now be before the alias check, > not > if-converted, but of course the alias-checking overhead can still be quite > big. > > > FTR, I don't object to relaxing the -O2 model. It was deliberately > > conservative, for a time when enabling vectorisation at -O2 was > > somewhat controversial. It was also heavily influenced by SVE, > > where variable trip counts are not an issue. > > I agree - I think we can try for GCC 15. Note since we disallow epilogue > vectorization with cheap we might want to prefer smaller vector sizes > which means the target might want to adjust its vector_modes hook. > > > The proposal would also make GCC's behaviour more similar to Clang's. > > So should we adjust very-cheap to allow niter peeling as proposed or > should we switch > the default at -O2 to cheap?
Any thoughts from other backend maintainers? > > Richard. > > > Thanks, > > Richard -- BR, Hongtao