Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

Richard Sandiford Wed, 18 Sep 2024 10:55:52 -0700

Richard Biener <richard.guent...@gmail.com> writes:
> On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazy...@gmail.com> wrote:
>>
>> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazy...@gmail.com> wrote:
>> >
>> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
>> > <richard.guent...@gmail.com> wrote:
>> > >
>> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao....@intel.com> wrote:
>> > > >
>> > > > GCC12 enables vectorization for O2 with very cheap cost model which is 
>> > > > restricted
>> > > > to constant tripcount. The vectorization capacity is very limited w/ 
>> > > > consideration
>> > > > of codesize impact.
>> > > >
>> > > > The patch extends the very cheap cost model a little bit to support 
>> > > > variable tripcount.
>> > > > But still disable peeling for gaps/alignment, runtime aliasing 
>> > > > checking and epilogue
>> > > > vectorization with the consideration of codesize.
>> > > >
>> > > > So there're at most 2 versions of loop for O2 vectorization, one 
>> > > > vectorized main loop
>> > > > , one scalar/remainder loop.
>> > > >
>> > > > .i.e.
>> > > >
>> > > > void
>> > > > foo1 (int* __restrict a, int* b, int* c, int n)
>> > > > {
>> > > >  for (int i = 0; i != n; i++)
>> > > >   a[i] = b[i] + c[i];
>> > > > }
>> > > >
>> > > > with -O2 -march=x86-64-v3, will be vectorized to
>> > > >
>> > > > .L10:
>> > > >         vmovdqu (%r8,%rax), %ymm0
>> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
>> > > >         vmovdqu %ymm0, (%rdi,%rax)
>> > > >         addq    $32, %rax
>> > > >         cmpq    %rdx, %rax
>> > > >         jne     .L10
>> > > >         movl    %ecx, %eax
>> > > >         andl    $-8, %eax
>> > > >         cmpl    %eax, %ecx
>> > > >         je      .L21
>> > > >         vzeroupper
>> > > > .L12:
>> > > >         movl    (%r8,%rax,4), %edx
>> > > >         addl    (%rsi,%rax,4), %edx
>> > > >         movl    %edx, (%rdi,%rax,4)
>> > > >         addq    $1, %rax
>> > > >         cmpl    %eax, %ecx
>> > > >         jne     .L12
>> > > >
>> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves 
>> > > > performance by 4.11%
>> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 
>> > > > 5.74% with
>> > > > extra 8.88% codesize. The details are as below
>> > >
>> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
>> > > model numbers?
>> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>> > >
>> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
>> > > >
>> > > >                     N-Iter      cheap cost model
>> > > > 500.perlbench_r     -0.12%      -0.12%
>> > > > 502.gcc_r           0.44%       -0.11%
>> > > > 505.mcf_r           0.17%       4.46%
>> > > > 520.omnetpp_r       0.28%       -0.27%
>> > > > 523.xalancbmk_r     0.00%       5.93%
>> > > > 525.x264_r          -0.09%      23.53%
>> > > > 531.deepsjeng_r     0.19%       0.00%
>> > > > 541.leela_r         0.22%       0.00%
>> > > > 548.exchange2_r     -11.54%     -22.34%
>> > > > 557.xz_r            0.74%       0.49%
>> > > > GEOMEAN INT         -1.04%      0.60%
>> > > >
>> > > > 503.bwaves_r        3.13%       4.72%
>> > > > 507.cactuBSSN_r     1.17%       0.29%
>> > > > 508.namd_r          0.39%       6.87%
>> > > > 510.parest_r        3.14%       8.52%
>> > > > 511.povray_r        0.10%       -0.20%
>> > > > 519.lbm_r           -0.68%      10.14%
>> > > > 521.wrf_r           68.20%      76.73%
>> > >
>> > > So this seems to regress as well?
>> > Niter increases performance less than the cheap cost model, that's
>> > expected, it is not a regression.
>> > >
>> > > > 526.blender_r       0.12%       0.12%
>> > > > 527.cam4_r          19.67%      23.21%
>> > > > 538.imagick_r       0.12%       0.24%
>> > > > 544.nab_r           0.63%       0.53%
>> > > > 549.fotonik3d_r     14.44%      9.43%
>> > > > 554.roms_r          12.39%      0.00%
>> > > > GEOMEAN FP          8.26%       9.41%
>> > > > GEOMEAN ALL         4.11%       5.74%
>>
>> I've tested the patch on aarch64, it shows similar improvement with
>> little codesize increasement.
>> I haven't tested it on other backends, but I think it would have
>> similar good improvements
>
> I think overall this is expected since a constant niter dividable by
> the VF isn't a common situation.  So the question is mostly whether
> we want to pay the size penalty or not.
>
> Looking only at docs the proposed change would make the very-cheap
> cost model nearly(?) equivalent to the cheap one so maybe the answer
> is to default to cheap rather than very-cheap?  One difference seems to
> be that cheap allows alias versioning.


I remember seeing cases in the past where we could generate an
excessive number of alias checks.  The cost model didn't account
for them very well, since the checks often became a fixed overhead
for all paths (both scalar and vector), especially if the checks
were fully if-converted, with one branch at the end.  The relevant
comparison is then between the original pre-vectorisation scalar code
and the code with alias checks, rather than between post-vectorisation
scalar code and post-vectorisation vector code.  Things might be better
now though.

FTR, I don't object to relaxing the -O2 model.  It was deliberately
conservative, for a time when enabling vectorisation at -O2 was
somewhat controversial.  It was also heavily influenced by SVE,
where variable trip counts are not an issue.

The proposal would also make GCC's behaviour more similar to Clang's.

Thanks,
Richard

Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

Reply via email to