On Thu, Sep 19, 2024 at 2:08 PM Richard Biener

<richard.guent...@gmail.com> wrote:
>
> On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
> >
> > Richard Biener <richard.guent...@gmail.com> writes:
> > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazy...@gmail.com> wrote:
> > >>
> > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote:
> > >> >
> > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> > >> > <richard.guent...@gmail.com> wrote:
> > >> > >
> > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> 
> > >> > > wrote:
> > >> > > >
> > >> > > > GCC12 enables vectorization for O2 with very cheap cost model 
> > >> > > > which is restricted
> > >> > > > to constant tripcount. The vectorization capacity is very limited 
> > >> > > > w/ consideration
> > >> > > > of codesize impact.
> > >> > > >
> > >> > > > The patch extends the very cheap cost model a little bit to 
> > >> > > > support variable tripcount.
> > >> > > > But still disable peeling for gaps/alignment, runtime aliasing 
> > >> > > > checking and epilogue
> > >> > > > vectorization with the consideration of codesize.
> > >> > > >
> > >> > > > So there're at most 2 versions of loop for O2 vectorization, one 
> > >> > > > vectorized main loop
> > >> > > > , one scalar/remainder loop.
> > >> > > >
> > >> > > > .i.e.
> > >> > > >
> > >> > > > void
> > >> > > > foo1 (int* __restrict a, int* b, int* c, int n)
> > >> > > > {
> > >> > > >  for (int i = 0; i != n; i++)
> > >> > > >   a[i] = b[i] + c[i];
> > >> > > > }
> > >> > > >
> > >> > > > with -O2 -march=x86-64-v3, will be vectorized to
> > >> > > >
> > >> > > > .L10:
> > >> > > >         vmovdqu (%r8,%rax), %ymm0
> > >> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > >> > > >         vmovdqu %ymm0, (%rdi,%rax)
> > >> > > >         addq    $32, %rax
> > >> > > >         cmpq    %rdx, %rax
> > >> > > >         jne     .L10
> > >> > > >         movl    %ecx, %eax
> > >> > > >         andl    $-8, %eax
> > >> > > >         cmpl    %eax, %ecx
> > >> > > >         je      .L21
> > >> > > >         vzeroupper
> > >> > > > .L12:
> > >> > > >         movl    (%r8,%rax,4), %edx
> > >> > > >         addl    (%rsi,%rax,4), %edx
> > >> > > >         movl    %edx, (%rdi,%rax,4)
> > >> > > >         addq    $1, %rax
> > >> > > >         cmpl    %eax, %ecx
> > >> > > >         jne     .L12
> > >> > > >
> > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves 
> > >> > > > performance by 4.11%
> > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance 
> > >> > > > by 5.74% with
> > >> > > > extra 8.88% codesize. The details are as below
> > >> > >
> > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > >> > > model numbers?
> > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> > >> > >
> > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > >> > > >
> > >> > > >                     N-Iter      cheap cost model
> > >> > > > 500.perlbench_r     -0.12%      -0.12%
> > >> > > > 502.gcc_r           0.44%       -0.11%
> > >> > > > 505.mcf_r           0.17%       4.46%
> > >> > > > 520.omnetpp_r       0.28%       -0.27%
> > >> > > > 523.xalancbmk_r     0.00%       5.93%
> > >> > > > 525.x264_r          -0.09%      23.53%
> > >> > > > 531.deepsjeng_r     0.19%       0.00%
> > >> > > > 541.leela_r         0.22%       0.00%
> > >> > > > 548.exchange2_r     -11.54%     -22.34%
> > >> > > > 557.xz_r            0.74%       0.49%
> > >> > > > GEOMEAN INT         -1.04%      0.60%
> > >> > > >
> > >> > > > 503.bwaves_r        3.13%       4.72%
> > >> > > > 507.cactuBSSN_r     1.17%       0.29%
> > >> > > > 508.namd_r          0.39%       6.87%
> > >> > > > 510.parest_r        3.14%       8.52%
> > >> > > > 511.povray_r        0.10%       -0.20%
> > >> > > > 519.lbm_r           -0.68%      10.14%
> > >> > > > 521.wrf_r           68.20%      76.73%
> > >> > >
> > >> > > So this seems to regress as well?
> > >> > Niter increases performance less than the cheap cost model, that's
> > >> > expected, it is not a regression.
> > >> > >
> > >> > > > 526.blender_r       0.12%       0.12%
> > >> > > > 527.cam4_r          19.67%      23.21%
> > >> > > > 538.imagick_r       0.12%       0.24%
> > >> > > > 544.nab_r           0.63%       0.53%
> > >> > > > 549.fotonik3d_r     14.44%      9.43%
> > >> > > > 554.roms_r          12.39%      0.00%
> > >> > > > GEOMEAN FP          8.26%       9.41%
> > >> > > > GEOMEAN ALL         4.11%       5.74%
> > >>
> > >> I've tested the patch on aarch64, it shows similar improvement with
> > >> little codesize increasement.
> > >> I haven't tested it on other backends, but I think it would have
> > >> similar good improvements
> > >
> > > I think overall this is expected since a constant niter dividable by
> > > the VF isn't a common situation.  So the question is mostly whether
> > > we want to pay the size penalty or not.
> > >
> > > Looking only at docs the proposed change would make the very-cheap
> > > cost model nearly(?) equivalent to the cheap one so maybe the answer
> > > is to default to cheap rather than very-cheap?  One difference seems to
> > > be that cheap allows alias versioning.
> >
> > I remember seeing cases in the past where we could generate an
> > excessive number of alias checks.  The cost model didn't account
> > for them very well, since the checks often became a fixed overhead
> > for all paths (both scalar and vector), especially if the checks
> > were fully if-converted, with one branch at the end.  The relevant
> > comparison is then between the original pre-vectorisation scalar code
> > and the code with alias checks, rather than between post-vectorisation
> > scalar code and post-vectorisation vector code.  Things might be better
> > now though.
>
> Yes, the cost model (aka niter) check should now be before the alias check, 
> not
> if-converted, but of course the alias-checking overhead can still be quite 
> big.
>
> > FTR, I don't object to relaxing the -O2 model.  It was deliberately
> > conservative, for a time when enabling vectorisation at -O2 was
> > somewhat controversial.  It was also heavily influenced by SVE,
> > where variable trip counts are not an issue.
>
> I agree - I think we can try for GCC 15.  Note since we disallow epilogue
> vectorization with cheap we might want to prefer smaller vector sizes
> which means the target might want to adjust its vector_modes hook.
>
> > The proposal would also make GCC's behaviour more similar to Clang's.
>
> So should we adjust very-cheap to allow niter peeling as proposed or
> should we switch
> the default at -O2 to cheap?

Any thoughts from other backend maintainers?

>
> Richard.
>
> > Thanks,
> > Richard



--
BR,
Hongtao

Reply via email to