On Fri, Oct 19, 2018 at 1:44 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On 10/18/18, Jan Hubicka <hubi...@ucw.cz> wrote:
> >> we need to generate
> >>
> >>      vxorp[ds]       %xmmN, %xmmN, %xmmN
> >>      ...
> >>      vcvtss2sd       f(%rip), %xmmN, %xmmX
> >>      ...
> >>      vcvtsi2ss       i(%rip), %xmmN, %xmmY
> >>
> >> to avoid partial XMM register stall.  This patch adds a pass to generate
> >> a single
> >>
> >>      vxorps          %xmmN, %xmmN, %xmmN
> >>
> >> at function entry, which is shared by all SF and DF conversions, instead
> >> of generating one
> >>
> >>      vxorp[ds]       %xmmN, %xmmN, %xmmN
> >>
> >> for each SF/DF conversion.
> >>
> >> Performance impacts on SPEC CPU 2017 rate with 1 copy using
> >>
> >> -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops
> >>
> >> are
> >>
> >> 1. On Broadwell server:
> >>
> >> 500.perlbench_r (-0.82%)
> >> 502.gcc_r (0.73%)
> >> 505.mcf_r (-0.24%)
> >> 520.omnetpp_r (-2.22%)
> >> 523.xalancbmk_r (-1.47%)
> >> 525.x264_r (0.31%)
> >> 531.deepsjeng_r (0.27%)
> >> 541.leela_r (0.85%)
> >> 548.exchange2_r (-0.11%)
> >> 557.xz_r (-0.34%)
> >> Geomean: (-0.23%)
> >>
> >> 503.bwaves_r (0.00%)
> >> 507.cactuBSSN_r (-1.88%)
> >> 508.namd_r (0.00%)
> >> 510.parest_r (-0.56%)
> >> 511.povray_r (0.49%)
> >> 519.lbm_r (-1.28%)
> >> 521.wrf_r (-0.28%)
> >> 526.blender_r (0.55%)
> >> 527.cam4_r (-0.20%)
> >> 538.imagick_r (2.52%)
> >> 544.nab_r (-0.18%)
> >> 549.fotonik3d_r (-0.51%)
> >> 554.roms_r (-0.22%)
> >> Geomean: (0.00%)
> >
> > I wonder why the patch seems to have more effect on specint that should not
> > care much
> > about float<->double conversions?
>
> These are within noise range.
>
> >> number of vxorp[ds]:
> >>
> >> before               after           difference
> >> 14570                4515            -69%
> >>
> >> OK for trunk?
> >
> > This looks very nice though.
> >
>
> > +  if (v4sf_const0)
> > +    {
> > +      /* Generate a single vxorps at function entry and preform df
> > +      rescan. */
> > +      bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
> > +      insn = BB_HEAD (bb);
> > +      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > +      set_insn = emit_insn_after (set, insn);
> > +      df_insn_rescan (set_insn);
> > +      df_process_deferred_rescans ();
> > +    }
> >
> > It seems suboptimal to place the const0 at the entry of function - if the
> > conversoin happens in cold region of function this will just increase
> > register
> > pressure.  I guess right answer would be to look for the postdominance
> > frontier
>
> Did you mean "the nearest common dominator"?
>
> > of the set of all uses of the zero register?
> >
>
> Here is the updated patch to adds a pass to generate a single
>
>         vxorps          %xmmN, %xmmN, %xmmN
>
> at entry of the nearest common dominator for basic blocks with SF/DF
> conversions.  OK for trunk?
>

PING:

https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01175.html


-- 
H.J.

Reply via email to