On Mon, Jan 28, 2019 at 9:08 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Tue, Jan 22, 2019 at 5:28 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Tue, Jan 22, 2019 at 4:08 AM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <l...@redhat.com> wrote: > > > > > > > > > > On 1/7/19 6:55 AM, H.J. Lu wrote: > > > > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote: > > > > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote: > > > > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> > > > > > >>>> wrote: > > > > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > > > > > >>>>>>>> Did you mean "the nearest common dominator"? > > > > > >>>>>>> If the nearest common dominator appears in the loop while all > > > > > >>>>>>> uses are > > > > > >>>>>>> out of loops, this will result in suboptimal xor placement. > > > > > >>>>>>> In this case you want to split edges out of the loop. > > > > > >>>>>>> > > > > > >>>>>>> In general this is what the LCM framework will do for you if > > > > > >>>>>>> the problem > > > > > >>>>>>> is modelled siimlar way as in mode_swtiching. At entry > > > > > >>>>>>> function mode is > > > > > >>>>>>> "no zero register needed" and all conversions need mode "zero > > > > > >>>>>>> register > > > > > >>>>>>> needed". Mode switching should then do the correct placement > > > > > >>>>>>> decisions > > > > > >>>>>>> (reaching minimal number of executions of xor). > > > > > >>>>>>> > > > > > >>>>>>> Jeff, whan is your optinion on the approach taken by the > > > > > >>>>>>> patch? > > > > > >>>>>>> It seems like a special case of more general issue, but I do > > > > > >>>>>>> not see > > > > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, > > > > > >>>>>>> so if > > > > > >>>>>>> the placement is correct we can probalby go either with new > > > > > >>>>>>> pass or > > > > > >>>>>>> making this part of mode swithcing (which is anyway run by > > > > > >>>>>>> x86 backend) > > > > > >>>>>> So I haven't followed this discussion at all, but did touch on > > > > > >>>>>> this > > > > > >>>>>> issue with some patch a month or two ago with a target patch > > > > > >>>>>> that was > > > > > >>>>>> trying to avoid the partial stalls. > > > > > >>>>>> > > > > > >>>>>> My assumption is that we're trying to find one or more places > > > > > >>>>>> to > > > > > >>>>>> initialize the upper half of an avx register so as to avoid > > > > > >>>>>> partial > > > > > >>>>>> register stall at existing sites that set the upper half. > > > > > >>>>>> > > > > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode > > > > > >>>>>> switching is just another variant). A common-dominator > > > > > >>>>>> approach is > > > > > >>>>>> closer to a classic GCSE and is going to result is more > > > > > >>>>>> initializations > > > > > >>>>>> at sub-optimal points than a PRE/LCM style. > > > > > >>>>> yes, it is usual code placement problem. It is special case > > > > > >>>>> because the > > > > > >>>>> zero register is not modified by the conversion (just we need > > > > > >>>>> to have > > > > > >>>>> zero somewhere). So basically we do not have kills to the zero > > > > > >>>>> except > > > > > >>>>> for entry block. > > > > > >>>>> > > > > > >>>> Do you have testcase to show thatf the nearest common dominator > > > > > >>>> in the loop, while all uses areout of loops, leads to suboptimal > > > > > >>>> xor > > > > > >>>> placement? > > > > > >>> I don't have a testcase, but it's all but certain nearest common > > > > > >>> dominator is going to be a suboptimal placement. That's going to > > > > > >>> create > > > > > >>> paths where you're going to emit the xor when it's not used. > > > > > >>> > > > > > >>> The whole point of the LCM algorithms is they are optimal in > > > > > >>> terms of > > > > > >>> expression evaluations. > > > > > >> We tried LCM and it didn't work well for this case. LCM places a > > > > > >> single > > > > > >> VXOR close to the location where it is needed, which can be inside > > > > > >> a > > > > > >> loop. There is nothing wrong with the LCM algorithms. But this > > > > > >> doesn't > > > > > >> solve > > > > > >> > > > > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 > > > > > >> > > > > > >> where VXOR is executed multiple times inside of a function, > > > > > >> instead of > > > > > >> just once. We are investigating to generate a single VXOR at > > > > > >> entry of the > > > > > >> nearest dominator for basic blocks with SF/DF conversions, which > > > > > >> is in > > > > > >> the the fake loop that contains the whole function: > > > > > >> > > > > > >> bb = nearest_common_dominator_for_set (CDI_DOMINATORS, > > > > > >> convert_bbs); > > > > > >> while (bb->loop_father->latch > > > > > >> != EXIT_BLOCK_PTR_FOR_FN (cfun)) > > > > > >> bb = get_immediate_dominator (CDI_DOMINATORS, > > > > > >> bb->loop_father->header); > > > > > >> > > > > > >> insn = BB_HEAD (bb); > > > > > >> if (!NONDEBUG_INSN_P (insn)) > > > > > >> insn = next_nonnote_nondebug_insn (insn); > > > > > >> set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); > > > > > >> set_insn = emit_insn_before (set, insn); > > > > > >> > > > > > > Here is the updated patch. OK for trunk? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > -- H.J. > > > > > > > > > > > > > > > > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch > > > > > > > > > > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 > > > > > > 2001 > > > > > > From: "H.J. Lu" <hjl.to...@gmail.com> > > > > > > Date: Mon, 7 Jan 2019 05:44:59 -0800 > > > > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency > > > > > > > > > > > > With -mavx, for > > > > > > > > > > > > $ cat foo.i > > > > > > extern float f; > > > > > > extern double d; > > > > > > extern int i; > > > > > > > > > > > > void > > > > > > foo (void) > > > > > > { > > > > > > d = f; > > > > > > f = i; > > > > > > } > > > > > > > > > > > > we need to generate > > > > > > > > > > > > vxorp[ds] %xmmN, %xmmN, %xmmN > > > > > > ... > > > > > > vcvtss2sd f(%rip), %xmmN, %xmmX > > > > > > ... > > > > > > vcvtsi2ss i(%rip), %xmmN, %xmmY > > > > > > > > > > > > to avoid partial XMM register stall. This patch adds a pass to > > > > > > generate > > > > > > a single > > > > > > > > > > > > vxorps %xmmN, %xmmN, %xmmN > > > > > > > > > > > > at entry of the nearest dominator for basic blocks with SF/DF > > > > > > conversions, > > > > > > which is in the fake loop that contains the whole function, instead > > > > > > of > > > > > > generating one > > > > > > > > > > > > vxorp[ds] %xmmN, %xmmN, %xmmN > > > > > > > > > > > > for each SF/DF conversion. > > > > > > > > > > > > NB: The LCM algorithm isn't appropriate here since it may place a > > > > > > vxorps > > > > > > inside the loop. Simple testcase show this: > > > > > > > > > > > > $ cat badcase.c > > > > > > > > > > > > extern float f; > > > > > > extern double d; > > > > > > > > > > > > void > > > > > > foo (int n, int k) > > > > > > { > > > > > > for (int j = 0; j != n; j++) > > > > > > if (j < k) > > > > > > d = f; > > > > > > } > > > > > > > > > > > > It generates > > > > > > > > > > > > ... > > > > > > loop: > > > > > > if(j < k) > > > > > > vxorps %xmm0, %xmm0, %xmm0 > > > > > > vcvtss2sd %xmm1, %xmm0, %xmm0 > > > > > > ... > > > > > > loopend > > > > > > ... > > > > > > > > > > > > This is because LCM only works when there is a certain benifit. > > > > > > But for > > > > > > conditional branch, LCM wouldn't move > > > > > > > > > > > > vxorps %xmm0, %xmm0, %xmm0 > > > > > It works this way for a reason. There are obviously paths through the > > > > > loop where the conversion does not happen and thus the vxor is not > > > > > needed or desirable on those paths. > > > > > > > > > > That's a fundamental property of the LCM algorithm -- it never inserts > > > > > an evaluation on a path through the CFG where it will not be used. > > > > > > > > > > Your algorithm of inserting into the dominator block will introduce > > > > > runtime executions of the vxor on paths where it is not needed. > > > > > > > > > > It's well known that relaxing that property of LCM can result in > > > > > better > > > > > code generation in some circumstances. Block copying and loop > > > > > restructuring are the gold standard for dealing with this kind of > > > > > problem. > > > > > > > > > > In this case you could split the iteration space so that you have two > > > > > loops. one for 0..k and the other for k..n. Note that GCC has > > > > > support > > > > > for this kind of loop restructuring. This has the advantage that the > > > > > j > > > > > < k test does not happen each iteration of the loop and the vxor stuff > > > > > via LCM would be optimal. > > > > > > > > > > There's many other cases where copying and restructuring results in > > > > > better common subexpression elimination (which is what you're doing). > > > > > Probably the best work I've seen in this space is Bodik's thesis. > > > > > Click's work from '95 touches on some of this as well, but isn't as > > > > > relevant to this specific instance. > > > > > > > > > > Anyway, whether or not the patch should move forward is really up to > > > > > Jan > > > > > (and Uros if he wants to be involved) I think. I'm not fundamentally > > > > > opposed to HJ's approach as I'm aware of the different tradeoffs. > > > > > > > > > > HJ's approach of pulling into the dominator block can result in > > > > > unnecessary evaluations. But it can also reduce the number of > > > > > evaluations in other cases. It really depends on the runtime behavior > > > > > of the code. One could argue that the vxor stuff we're talking about > > > > > is most likely happening in loops, and probably not in deeply nested > > > > > control structures within those loops. Thus pulling them out more > > > > > aggressively ala LICM may be better than LCM. > > > > > > > > True, there is a trade-off. My approach inserts a vxorps at the last > > > > possible > > > > position. Yes, vxorps will always be executed even if it may not be > > > > required. > > > > Since it is executed only once in all cases, it is a win overall. > > > > > > Hopefully a simple vpxor won't end up powering up the other AVX512 > > > unit if it lay dormant ... > > > > A 128-bit AVX vpxor won't touch AVX512. > > > > > And if we ever get to the state of having two separate ISAs in the same > > > function then you'd need to make sure you can execute vpxor in the > > > place you are inserting since it may now be executed when it wasn't > > > before (and I assume you already check that you do not zero the > > > reg if there's a value life in it if the conditional def you are > > > instrumenting > > > is not executed). > > > > A dedicated pseudo register is allocated and zeroed for INT->FP and > > FP->FP conversions. IRA/LRA take care of the rest. > > > > PING: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html >
Here is the updated patch adjusted after PR target/89071 fix. OK for trunk? Thanks. -- H.J.
From 1c35abb368f26cc601e8badf22c8729156429251 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Mon, 7 Jan 2019 05:44:59 -0800 Subject: [PATCH] [8/9 Regression] i386: Add pass_remove_partial_avx_dependency With -mavx, for $ cat foo.i extern float f; extern double d; extern int i; void foo (void) { d = f; f = i; } we need to generate vxorp[ds] %xmmN, %xmmN, %xmmN ... vcvtss2sd f(%rip), %xmmN, %xmmX ... vcvtsi2ss i(%rip), %xmmN, %xmmY to avoid partial XMM register stall. This patch adds a pass to generate a single vxorps %xmmN, %xmmN, %xmmN at entry of the nearest dominator for basic blocks with SF/DF conversions, which is in the fake loop that contains the whole function, instead of generating one vxorp[ds] %xmmN, %xmmN, %xmmN for each SF/DF conversion. NB: The LCM algorithm isn't appropriate here since it may place a vxorps inside the loop. Simple testcase show this: $ cat badcase.c extern float f; extern double d; void foo (int n, int k) { for (int j = 0; j != n; j++) if (j < k) d = f; } It generates ... loop: if(j < k) vxorps %xmm0, %xmm0, %xmm0 vcvtss2sd f(%rip), %xmm0, %xmm0 ... loopend ... This is because LCM only works when there is a certain benifit. But for conditional branch, LCM wouldn't move vxorps %xmm0, %xmm0, %xmm0 out of loop. SPEC CPU 2017 on Intel Xeon with AVX512 shows: 1. The nearest dominator |RATE |Improvement| |500.perlbench_r | 0.55% | |538.imagick_r | 8.43% | |544.nab_r | 0.71% | 2. LCM |RATE |Improvement| |500.perlbench_r | -0.76% | |538.imagick_r | 7.96% | |544.nab_r | -0.13% | Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512 using -Ofast -flto -march=skylake-avx512 -funroll-loops before commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576 Author: uros <uros@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Thu Jan 31 20:06:42 2019 +0000 PR target/89071 * config/i386/i386.md (*extendsfdf2): Split out reg->reg alternative to avoid partial SSE register stall for TARGET_AVX. (truncdfsf2): Ditto. (sse4_1_round<mode>2): Ditto. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427 138bc75d-0d04-0410-961f-82ee72b054a4 are: |INT RATE |Improvement| |500.perlbench_r | 0.55% | |502.gcc_r | 0.14% | |505.mcf_r | 0.08% | |523.xalancbmk_r | 0.18% | |525.x264_r |-0.49% | |531.deepsjeng_r |-0.04% | |541.leela_r |-0.26% | |548.exchange2_r |-0.3% | |557.xz_r |BuildSame| |FP RATE |Improvement| |503.bwaves_r |-0.29% | |507.cactuBSSN_r | 0.04% | |508.namd_r |-0.74% | |510.parest_r |-0.01% | |511.povray_r | 2.23% | |519.lbm_r | 0.1% | |521.wrf_r | 0.49% | |526.blender_r | 0.13% | |527.cam4_r | 0.65% | |538.imagick_r | 8.43% | |544.nab_r | 0.71% | |549.fotonik3d_r | 0.15% | |554.roms_r | 0.08% | After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client, impacts on 538.imagick_r with -fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto 1. Size comparision: before: text data bss dec hex filename 2465633 8352 4528 2478513 25d1b1 imagick_r after: text data bss dec hex filename 2447145 8352 4528 2460025 258979 imagick_r 2. Number of vxorps: before after difference 6890 5311 -29.73% 3. Performance improvement: |RATE |Improvement| |538.imagick_r | 4.87% | gcc/ 2019-02-01 H.J. Lu <hongjiu.lu@intel.com> Hongtao Liu <hongtao.liu@intel.com> Sunil K Pandey <sunil.k.pandey@intel.com> PR target/87007 * config/i386/i386-passes.def: Add pass_remove_partial_avx_dependency. * config/i386/i386-protos.h (make_pass_remove_partial_avx_dependency): New. * config/i386/i386.c (make_pass_remove_partial_avx_dependency): New function. (pass_data_remove_partial_avx_dependency): New. (pass_remove_partial_avx_dependency): Likewise. (make_pass_remove_partial_avx_dependency): Likewise. * config/i386/i386.md (partial_xmm_update): New attribute. (*extendsfdf2): Add partial_xmm_update. (truncdfsf2): Likewise. (*float<SWI48:mode><MODEF:mode>2): Likewise. (SF/DF conversion splitters): Disabled for TARGET_AVX. gcc/testsuite/ 2019-02-01 H.J. Lu <hongjiu.lu@intel.com> Hongtao Liu <hongtao.liu@intel.com> Sunil K Pandey <sunil.k.pandey@intel.com> PR target/87007 * gcc.target/i386/pr87007-1.c: New test. * gcc.target/i386/pr87007-2.c: Likewise. --- gcc/config/i386/i386-passes.def | 2 + gcc/config/i386/i386-protos.h | 2 + gcc/config/i386/i386.c | 174 ++++++++++++++++++++++ gcc/config/i386/i386.md | 16 +- gcc/testsuite/gcc.target/i386/pr87007-1.c | 15 ++ gcc/testsuite/gcc.target/i386/pr87007-2.c | 18 +++ 6 files changed, 224 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-2.c diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def index 87cfd94b8f6..f4facdc65d4 100644 --- a/gcc/config/i386/i386-passes.def +++ b/gcc/config/i386/i386-passes.def @@ -31,3 +31,5 @@ along with GCC; see the file COPYING3. If not see INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */); INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch); + + INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency); diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 2d600173917..83645e89a81 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -369,3 +369,5 @@ class rtl_opt_pass; extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *); extern rtl_opt_pass *make_pass_stv (gcc::context *); extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *); +extern rtl_opt_pass *make_pass_remove_partial_avx_dependency + (gcc::context *); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 4e67abe8764..b8e39176c6a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2793,6 +2793,180 @@ make_pass_insert_endbranch (gcc::context *ctxt) return new pass_insert_endbranch (ctxt); } +/* At entry of the nearest common dominator for basic blocks with + conversions, generate a single + vxorps %xmmN, %xmmN, %xmmN + for all + vcvtss2sd op, %xmmN, %xmmX + vcvtsd2ss op, %xmmN, %xmmX + vcvtsi2ss op, %xmmN, %xmmX + vcvtsi2sd op, %xmmN, %xmmX + + NB: We want to generate only a single vxorps to cover the whole + function. The LCM algorithm isn't appropriate here since it may + place a vxorps inside the loop. */ + +static unsigned int +remove_partial_avx_dependency (void) +{ + timevar_push (TV_MACH_DEP); + + calculate_dominance_info (CDI_DOMINATORS); + df_set_flags (DF_DEFER_INSN_RESCAN); + df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN); + df_md_add_problem (); + df_analyze (); + + bitmap_obstack_initialize (NULL); + bitmap convert_bbs = BITMAP_ALLOC (NULL); + + basic_block bb; + rtx_insn *insn, *set_insn; + rtx set; + rtx v4sf_const0 = NULL_RTX; + + FOR_EACH_BB_FN (bb, cfun) + { + FOR_BB_INSNS (bb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + set = single_set (insn); + if (!set) + continue; + + if (get_attr_partial_xmm_update (insn) + != PARTIAL_XMM_UPDATE_TRUE) + continue; + + if (!v4sf_const0) + v4sf_const0 = gen_reg_rtx (V4SFmode); + + /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF, + SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and + vec_merge with subreg. */ + rtx src = SET_SRC (set); + rtx dest = SET_DEST (set); + machine_mode dest_mode = GET_MODE (dest); + + rtx zero; + machine_mode dest_vecmode; + if (dest_mode == E_SFmode) + { + dest_vecmode = V4SFmode; + zero = v4sf_const0; + } + else + { + dest_vecmode = V2DFmode; + zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0); + } + + /* Change source to vector mode. */ + src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src); + src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero, + GEN_INT (HOST_WIDE_INT_1U)); + /* Change destination to vector mode. */ + rtx vec = gen_reg_rtx (dest_vecmode); + /* Generate an XMM vector SET. */ + set = gen_rtx_SET (vec, src); + set_insn = emit_insn_before (set, insn); + df_insn_rescan (set_insn); + + src = gen_rtx_SUBREG (dest_mode, vec, 0); + set = gen_rtx_SET (dest, src); + + /* Drop possible dead definitions. */ + PATTERN (insn) = set; + + INSN_CODE (insn) = -1; + recog_memoized (insn); + df_insn_rescan (insn); + bitmap_set_bit (convert_bbs, bb->index); + } + } + + if (v4sf_const0) + { + /* (Re-)discover loops so that bb->loop_father can be used in the + analysis below. */ + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + + /* Generate a vxorps at entry of the nearest dominator for basic + blocks with conversions, which is in the the fake loop that + contains the whole function, so that there is only a single + vxorps in the whole function. */ + bb = nearest_common_dominator_for_set (CDI_DOMINATORS, + convert_bbs); + while (bb->loop_father->latch + != EXIT_BLOCK_PTR_FOR_FN (cfun)) + bb = get_immediate_dominator (CDI_DOMINATORS, + bb->loop_father->header); + + insn = BB_HEAD (bb); + if (!NONDEBUG_INSN_P (insn)) + insn = next_nonnote_nondebug_insn (insn); + set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); + set_insn = emit_insn_before (set, insn); + df_insn_rescan (set_insn); + df_process_deferred_rescans (); + loop_optimizer_finalize (); + } + + bitmap_obstack_release (NULL); + BITMAP_FREE (convert_bbs); + + timevar_pop (TV_MACH_DEP); + return 0; +} + +namespace { + +const pass_data pass_data_remove_partial_avx_dependency = +{ + RTL_PASS, /* type */ + "rpad", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_MACH_DEP, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_remove_partial_avx_dependency : public rtl_opt_pass +{ +public: + pass_remove_partial_avx_dependency (gcc::context *ctxt) + : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { + return (TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && TARGET_SSE_MATH + && optimize + && optimize_function_for_speed_p (cfun)); + } + + virtual unsigned int execute (function *) + { + return remove_partial_avx_dependency (); + } +}; // class pass_rpad + +} // anon namespace + +rtl_opt_pass * +make_pass_remove_partial_avx_dependency (gcc::context *ctxt) +{ + return new pass_remove_partial_avx_dependency (ctxt); +} + /* Return true if a red-zone is in use. We can't use red-zone when there are local indirect jumps, like "indirect_jump" or "tablejump", which jumps to another place in the function, since "call" in the diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 744f155fca6..f589bbe6e68 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -778,6 +778,10 @@ (define_attr "i387_cw" "trunc,floor,ceil,uninitialized,any" (const_string "any")) +;; Define attribute to indicate insns with partial XMM register update. +(define_attr "partial_xmm_update" "false,true" + (const_string "false")) + ;; Define attribute to classify add/sub insns that consumes carry flag (CF) (define_attr "use_carry" "0,1" (const_string "0")) @@ -4391,6 +4395,7 @@ } } [(set_attr "type" "fmov,fmov,ssecvt,ssecvt") + (set_attr "partial_xmm_update" "false,false,false,true") (set_attr "prefix" "orig,orig,maybe_vex,maybe_vex") (set_attr "mode" "SF,XF,DF,DF") (set (attr "enabled") @@ -4480,7 +4485,8 @@ [(set (match_operand:DF 0 "sse_reg_operand") (float_extend:DF (match_operand:SF 1 "nonimmediate_operand")))] - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + "!TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed && optimize_function_for_speed_p (cfun) && (!REG_P (operands[1]) || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1]))) @@ -4557,6 +4563,7 @@ } } [(set_attr "type" "fmov,fmov,ssecvt,ssecvt") + (set_attr "partial_xmm_update" "false,false,false,true") (set_attr "mode" "SF") (set (attr "enabled") (if_then_else @@ -4640,7 +4647,8 @@ [(set (match_operand:SF 0 "sse_reg_operand") (float_truncate:SF (match_operand:DF 1 "nonimmediate_operand")))] - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + "!TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed && optimize_function_for_speed_p (cfun) && (!REG_P (operands[1]) || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1]))) @@ -5016,6 +5024,7 @@ %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1} %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}" [(set_attr "type" "fmov,sseicvt,sseicvt") + (set_attr "partial_xmm_update" "false,true,true") (set_attr "prefix" "orig,maybe_vex,maybe_vex") (set_attr "mode" "<MODEF:MODE>") (set (attr "prefix_rex") @@ -5144,7 +5153,8 @@ (define_split [(set (match_operand:MODEF 0 "sse_reg_operand") (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))] - "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed + "!TARGET_AVX + && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed && optimize_function_for_speed_p (cfun) && (!EXT_REX_SSE_REG_P (operands[0]) || TARGET_AVX512VL)" diff --git a/gcc/testsuite/gcc.target/i386/pr87007-1.c b/gcc/testsuite/gcc.target/i386/pr87007-1.c new file mode 100644 index 00000000000..93cf1dcdfa5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr87007-1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (void) +{ + d = f; + f = i; +} + +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr87007-2.c b/gcc/testsuite/gcc.target/i386/pr87007-2.c new file mode 100644 index 00000000000..cca7ae7afbc --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr87007-2.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ + +extern float f; +extern double d; +extern int i; + +void +foo (int n, int k) +{ + for (int i = 0; i != n; i++) + if(i < k) + d = f; + else + f = i; +} + +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */ -- 2.20.1