On Fri, Oct 19, 2018 at 1:44 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On 10/18/18, Jan Hubicka <hubi...@ucw.cz> wrote: > >> we need to generate > >> > >> vxorp[ds] %xmmN, %xmmN, %xmmN > >> ... > >> vcvtss2sd f(%rip), %xmmN, %xmmX > >> ... > >> vcvtsi2ss i(%rip), %xmmN, %xmmY > >> > >> to avoid partial XMM register stall. This patch adds a pass to generate > >> a single > >> > >> vxorps %xmmN, %xmmN, %xmmN > >> > >> at function entry, which is shared by all SF and DF conversions, instead > >> of generating one > >> > >> vxorp[ds] %xmmN, %xmmN, %xmmN > >> > >> for each SF/DF conversion. > >> > >> Performance impacts on SPEC CPU 2017 rate with 1 copy using > >> > >> -Ofast -march=native -mfpmath=sse -fno-associative-math -funroll-loops > >> > >> are > >> > >> 1. On Broadwell server: > >> > >> 500.perlbench_r (-0.82%) > >> 502.gcc_r (0.73%) > >> 505.mcf_r (-0.24%) > >> 520.omnetpp_r (-2.22%) > >> 523.xalancbmk_r (-1.47%) > >> 525.x264_r (0.31%) > >> 531.deepsjeng_r (0.27%) > >> 541.leela_r (0.85%) > >> 548.exchange2_r (-0.11%) > >> 557.xz_r (-0.34%) > >> Geomean: (-0.23%) > >> > >> 503.bwaves_r (0.00%) > >> 507.cactuBSSN_r (-1.88%) > >> 508.namd_r (0.00%) > >> 510.parest_r (-0.56%) > >> 511.povray_r (0.49%) > >> 519.lbm_r (-1.28%) > >> 521.wrf_r (-0.28%) > >> 526.blender_r (0.55%) > >> 527.cam4_r (-0.20%) > >> 538.imagick_r (2.52%) > >> 544.nab_r (-0.18%) > >> 549.fotonik3d_r (-0.51%) > >> 554.roms_r (-0.22%) > >> Geomean: (0.00%) > > > > I wonder why the patch seems to have more effect on specint that should not > > care much > > about float<->double conversions? > > These are within noise range. > > >> number of vxorp[ds]: > >> > >> before after difference > >> 14570 4515 -69% > >> > >> OK for trunk? > > > > This looks very nice though. > > > > > + if (v4sf_const0) > > + { > > + /* Generate a single vxorps at function entry and preform df > > + rescan. */ > > + bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb; > > + insn = BB_HEAD (bb); > > + set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); > > + set_insn = emit_insn_after (set, insn); > > + df_insn_rescan (set_insn); > > + df_process_deferred_rescans (); > > + } > > > > It seems suboptimal to place the const0 at the entry of function - if the > > conversoin happens in cold region of function this will just increase > > register > > pressure. I guess right answer would be to look for the postdominance > > frontier > > Did you mean "the nearest common dominator"? > > > of the set of all uses of the zero register? > > > > Here is the updated patch to adds a pass to generate a single > > vxorps %xmmN, %xmmN, %xmmN > > at entry of the nearest common dominator for basic blocks with SF/DF > conversions. OK for trunk? >
PING: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01175.html -- H.J.