On Tue, Nov 12, 2019 at 2:48 AM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 4:41 PM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Tue, Nov 12, 2019 at 9:29 AM Hongtao Liu <crazy...@gmail.com> wrote:
> > >
> > > On Tue, Nov 12, 2019 at 4:19 PM Richard Biener
> > > <richard.guent...@gmail.com> wrote:
> > > >
> > > > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu <crazy...@gmail.com> wrote:
> > > > >
> > > > > Hi:
> > > > >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > > > > all AVX target because we found there's still performance gap between
> > > > > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > > > > epilog vectorized.
> > > > >   The performance influence of setting avx128_optimal as default on
> > > > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > > > > CLX is as bellow:
> > > > >
> > > > >     INT rate
> > > > >     500.perlbench_r         -0.32%
> > > > >     502.gcc_r                       -1.32%
> > > > >     505.mcf_r                       -0.12%
> > > > >     520.omnetpp_r                   -0.34%
> > > > >     523.xalancbmk_r         -0.65%
> > > > >     525.x264_r                      2.23%
> > > > >     531.deepsjeng_r         0.81%
> > > > >     541.leela_r                     -0.02%
> > > > >     548.exchange2_r         10.89%  ----------> big improvement
> > > > >     557.xz_r                        0.38%
> > > > >     geomean for intrate             1.10%
> > > > >
> > > > >     FP rate
> > > > >     503.bwaves_r                    1.41%
> > > > >     507.cactuBSSN_r         -0.14%
> > > > >     508.namd_r                      1.54%
> > > > >     510.parest_r                    -0.87%
> > > > >     511.povray_r                    0.28%
> > > > >     519.lbm_r                       0.32%
> > > > >     521.wrf_r                       -0.54%
> > > > >     526.blender_r                   0.59%
> > > > >     527.cam4_r                      -2.70%
> > > > >     538.imagick_r                   3.92%
> > > > >     544.nab_r                       0.59%
> > > > >     549.fotonik3d_r         -5.44%  -------------> regression
> > > > >     554.roms_r                      -2.34%
> > > > >     geomean for fprate              -0.28%
> > > > >
> > > > > The 10% improvement of 548.exchange_r is because there is 9-layer
> > > > > nested loop, and the loop count for innermost layer is small(enough
> > > > > for 128-bit vectorization, but not for 256-bit vectorization).
> > > > > Since loop count is not statically analyzed out, vectorizer will
> > > > > choose 256-bit vectorization which would never never be triggered. The
> > > > > vectorization of epilog will introduced some extra instructions,
> > > > > normally it will bring back some performance, but since it's 9-layer
> > > > > nested loop, costs of extra instructions will cover the gain.
> > > > >
> > > > > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > > > > vectorization is better than 128-bit vectorization. Generally when
> > > > > enabling 256-bit or 512-bit vectorization, there will be instruction
> > > > > clocksticks reduction also with frequency reduction. when frequency
> > > > > reduction is less than instructions clocksticks reduction, long vector
> > > > > width vectorization would be better than shorter one, otherwise the
> > > > > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > > > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > > > > vectorization is best.
> > > > >
> > > > > Bootstrap and regression test on i386 is ok.
> > > > > Ok for trunk?
> > > >
> > > > I don't think 128_optimal does what you think it does.  If you want to
> > > > prefer 128bit AVX adjust the preference, but 128_optimal describes
> > > > a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
> > > But it will set target_prefer_avx128 by default.
> > > ------------------------
> > > 2694  /* Enable 128-bit AVX instruction generation
> > > 2695     for the auto-vectorizer.  */
> > > 2696  if (TARGET_AVX128_OPTIMAL
> > > 2697      && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> > > 2698    opts->x_prefer_vector_width_type = PVW_AVX128;
> > > -------------------------
> > > And it may be too confusing to add another tuning flag.
> >
> > Well, it's confusing to mix two things - defaulting the vector width 
> > preference
> > and the architectural detail of Bulldozer and early Zen parts.  So please 
> > split
> > the tuning.  And then re-benchmark with _just_ changing the preference
> Actually, the result is similar, I've test both(patch using
> avx128_optimal and trunk_gcc apply additional
> -mprefer-vector-width=128).
> And i would give a test to see the affect of FDO.

It is hard to tell if 128-bit vector size or 256-bit vector size works better.
For SPEC CPU 2017, 128-bit vector size gives better overall scores.
One can always change vector size, even to 512-bit, as some workloads
are faster with 512-bit vector size.

-- 
H.J.

Reply via email to