On Mon, Jan 28, 2019 at 9:08 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Tue, Jan 22, 2019 at 5:28 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Tue, Jan 22, 2019 at 4:08 AM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law <l...@redhat.com> wrote:
> > > > >
> > > > > On 1/7/19 6:55 AM, H.J. Lu wrote:
> > > > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law <l...@redhat.com> wrote:
> > > > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote:
> > > > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka <hubi...@ucw.cz> 
> > > > > >>>> wrote:
> > > > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote:
> > > > > >>>>>>>> Did you mean "the nearest common dominator"?
> > > > > >>>>>>> If the nearest common dominator appears in the loop while all 
> > > > > >>>>>>> uses are
> > > > > >>>>>>> out of loops, this will result in suboptimal xor placement.
> > > > > >>>>>>> In this case you want to split edges out of the loop.
> > > > > >>>>>>>
> > > > > >>>>>>> In general this is what the LCM framework will do for you if 
> > > > > >>>>>>> the problem
> > > > > >>>>>>> is modelled siimlar way as in mode_swtiching.  At entry 
> > > > > >>>>>>> function mode is
> > > > > >>>>>>> "no zero register needed" and all conversions need mode "zero 
> > > > > >>>>>>> register
> > > > > >>>>>>> needed".  Mode switching should then do the correct placement 
> > > > > >>>>>>> decisions
> > > > > >>>>>>> (reaching minimal number of executions of xor).
> > > > > >>>>>>>
> > > > > >>>>>>> Jeff, whan is your optinion on the approach taken by the 
> > > > > >>>>>>> patch?
> > > > > >>>>>>> It seems like a special case of more general issue, but I do 
> > > > > >>>>>>> not see
> > > > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, 
> > > > > >>>>>>> so if
> > > > > >>>>>>> the placement is correct we can probalby go either with new 
> > > > > >>>>>>> pass or
> > > > > >>>>>>> making this part of mode swithcing (which is anyway run by 
> > > > > >>>>>>> x86 backend)
> > > > > >>>>>> So I haven't followed this discussion at all, but did touch on 
> > > > > >>>>>> this
> > > > > >>>>>> issue with some patch a month or two ago with a target patch 
> > > > > >>>>>> that was
> > > > > >>>>>> trying to avoid the partial stalls.
> > > > > >>>>>>
> > > > > >>>>>> My assumption is that we're trying to find one or more places 
> > > > > >>>>>> to
> > > > > >>>>>> initialize the upper half of an avx register so as to avoid 
> > > > > >>>>>> partial
> > > > > >>>>>> register stall at existing sites that set the upper half.
> > > > > >>>>>>
> > > > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode
> > > > > >>>>>> switching is just another variant).   A common-dominator 
> > > > > >>>>>> approach is
> > > > > >>>>>> closer to a classic GCSE and is going to result is more 
> > > > > >>>>>> initializations
> > > > > >>>>>> at sub-optimal points than a PRE/LCM style.
> > > > > >>>>> yes, it is usual code placement problem. It is special case 
> > > > > >>>>> because the
> > > > > >>>>> zero register is not modified by the conversion (just we need 
> > > > > >>>>> to have
> > > > > >>>>> zero somewhere).  So basically we do not have kills to the zero 
> > > > > >>>>> except
> > > > > >>>>> for entry block.
> > > > > >>>>>
> > > > > >>>> Do you have  testcase to show thatf the nearest common dominator
> > > > > >>>> in the loop, while all uses areout of loops, leads to suboptimal 
> > > > > >>>> xor
> > > > > >>>> placement?
> > > > > >>> I don't have a testcase, but it's all but certain nearest common
> > > > > >>> dominator is going to be a suboptimal placement.  That's going to 
> > > > > >>> create
> > > > > >>> paths where you're going to emit the xor when it's not used.
> > > > > >>>
> > > > > >>> The whole point of the LCM algorithms is they are optimal in 
> > > > > >>> terms of
> > > > > >>> expression evaluations.
> > > > > >> We tried LCM and it didn't work well for this case.  LCM places a 
> > > > > >> single
> > > > > >> VXOR close to the location where it is needed, which can be inside 
> > > > > >> a
> > > > > >> loop.  There is nothing wrong with the LCM algorithms.   But this 
> > > > > >> doesn't
> > > > > >> solve
> > > > > >>
> > > > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
> > > > > >>
> > > > > >> where VXOR is executed multiple times inside of a function, 
> > > > > >> instead of
> > > > > >> just once.   We are investigating to generate a single VXOR at 
> > > > > >> entry of the
> > > > > >> nearest dominator for basic blocks with SF/DF conversions, which 
> > > > > >> is in
> > > > > >> the the fake loop that contains the whole function:
> > > > > >>
> > > > > >>       bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
> > > > > >>                                              convert_bbs);
> > > > > >>       while (bb->loop_father->latch
> > > > > >>              != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > > > > >>         bb = get_immediate_dominator (CDI_DOMINATORS,
> > > > > >>                                       bb->loop_father->header);
> > > > > >>
> > > > > >>       insn = BB_HEAD (bb);
> > > > > >>       if (!NONDEBUG_INSN_P (insn))
> > > > > >>         insn = next_nonnote_nondebug_insn (insn);
> > > > > >>       set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
> > > > > >>       set_insn = emit_insn_before (set, insn);
> > > > > >>
> > > > > > Here is the updated patch.  OK for trunk?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > -- H.J.
> > > > > >
> > > > > >
> > > > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch
> > > > > >
> > > > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 
> > > > > > 2001
> > > > > > From: "H.J. Lu" <hjl.to...@gmail.com>
> > > > > > Date: Mon, 7 Jan 2019 05:44:59 -0800
> > > > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency
> > > > > >
> > > > > > With -mavx, for
> > > > > >
> > > > > > $ cat foo.i
> > > > > > extern float f;
> > > > > > extern double d;
> > > > > > extern int i;
> > > > > >
> > > > > > void
> > > > > > foo (void)
> > > > > > {
> > > > > >   d = f;
> > > > > >   f = i;
> > > > > > }
> > > > > >
> > > > > > we need to generate
> > > > > >
> > > > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > > > >       ...
> > > > > >       vcvtss2sd       f(%rip), %xmmN, %xmmX
> > > > > >       ...
> > > > > >       vcvtsi2ss       i(%rip), %xmmN, %xmmY
> > > > > >
> > > > > > to avoid partial XMM register stall.  This patch adds a pass to 
> > > > > > generate
> > > > > > a single
> > > > > >
> > > > > >       vxorps          %xmmN, %xmmN, %xmmN
> > > > > >
> > > > > > at entry of the nearest dominator for basic blocks with SF/DF 
> > > > > > conversions,
> > > > > > which is in the fake loop that contains the whole function, instead 
> > > > > > of
> > > > > > generating one
> > > > > >
> > > > > >       vxorp[ds]       %xmmN, %xmmN, %xmmN
> > > > > >
> > > > > > for each SF/DF conversion.
> > > > > >
> > > > > > NB: The LCM algorithm isn't appropriate here since it may place a 
> > > > > > vxorps
> > > > > > inside the loop.  Simple testcase show this:
> > > > > >
> > > > > > $ cat badcase.c
> > > > > >
> > > > > > extern float f;
> > > > > > extern double d;
> > > > > >
> > > > > > void
> > > > > > foo (int n, int k)
> > > > > > {
> > > > > >   for (int j = 0; j != n; j++)
> > > > > >     if (j < k)
> > > > > >       d = f;
> > > > > > }
> > > > > >
> > > > > > It generates
> > > > > >
> > > > > >     ...
> > > > > >     loop:
> > > > > >       if(j < k)
> > > > > >         vxorps  %xmm0, %xmm0, %xmm0
> > > > > >         vcvtss2sd  %xmm1, %xmm0, %xmm0
> > > > > >       ...
> > > > > >     loopend
> > > > > >     ...
> > > > > >
> > > > > > This is because LCM only works when there is a certain benifit.  
> > > > > > But for
> > > > > > conditional branch, LCM wouldn't move
> > > > > >
> > > > > >    vxorps  %xmm0, %xmm0, %xmm0
> > > > > It works this way for a reason.  There are obviously paths through the
> > > > > loop where the conversion does not happen and thus the vxor is not
> > > > > needed or desirable on those paths.
> > > > >
> > > > > That's a fundamental property of the LCM algorithm -- it never inserts
> > > > > an evaluation on a path through the CFG where it will not be used.
> > > > >
> > > > > Your algorithm of inserting into the dominator block will introduce
> > > > > runtime executions of the vxor on paths where it is not needed.
> > > > >
> > > > > It's well known that relaxing that property of LCM can result in 
> > > > > better
> > > > > code generation in some circumstances.  Block copying and loop
> > > > > restructuring are the gold standard for dealing with this kind of 
> > > > > problem.
> > > > >
> > > > > In this case you could split the iteration space so that you have two
> > > > > loops.  one for 0..k and the other for k..n.  Note that GCC has 
> > > > > support
> > > > > for this kind of loop restructuring.  This has the advantage that the 
> > > > > j
> > > > > < k test does not happen each iteration of the loop and the vxor stuff
> > > > > via LCM would be optimal.
> > > > >
> > > > > There's many other cases where copying and restructuring results in
> > > > > better common subexpression elimination (which is what you're doing).
> > > > > Probably the best work I've seen in this space is Bodik's thesis.
> > > > > Click's work from '95 touches on some of this as well, but isn't as
> > > > > relevant to this specific instance.
> > > > >
> > > > > Anyway, whether or not the patch should move forward is really up to 
> > > > > Jan
> > > > > (and Uros if he wants to be involved) I think.  I'm not fundamentally
> > > > > opposed to HJ's approach as I'm aware of the different tradeoffs.
> > > > >
> > > > > HJ's approach of pulling into the dominator block can result in
> > > > > unnecessary evaluations.  But it can also reduce the number of
> > > > > evaluations in other cases.  It really depends on the runtime behavior
> > > > > of the code.   One could argue that the vxor stuff we're talking about
> > > > > is most likely happening in loops, and probably not in deeply nested
> > > > > control structures within those loops.  Thus pulling them out more
> > > > > aggressively ala LICM may be better than LCM.
> > > >
> > > > True, there is a trade-off.   My approach inserts a vxorps at the last 
> > > > possible
> > > > position.  Yes, vxorps will always be executed even if it may not be 
> > > > required.
> > > > Since it is executed only once in all cases, it is a win overall.
> > >
> > > Hopefully a simple vpxor won't end up powering up the other AVX512
> > > unit if it lay dormant ...
> >
> > A 128-bit AVX vpxor won't touch AVX512.
> >
> > > And if we ever get to the state of having two separate ISAs in the same
> > > function then you'd need to make sure you can execute vpxor in the
> > > place you are inserting since it may now be executed when it wasn't
> > > before (and I assume you already check that you do not zero the
> > > reg if there's a value life in it if the conditional def you are 
> > > instrumenting
> > > is not executed).
> >
> > A dedicated pseudo register is allocated and zeroed for INT->FP and
> > FP->FP conversions.  IRA/LRA take care of the rest.
> >
>
> PING:
>
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
>

Here is the updated patch adjusted after PR target/89071 fix.

OK for trunk?

Thanks.

-- 
H.J.
From 1c35abb368f26cc601e8badf22c8729156429251 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 7 Jan 2019 05:44:59 -0800
Subject: [PATCH] [8/9 Regression] i386: Add pass_remove_partial_avx_dependency

With -mavx, for

$ cat foo.i
extern float f;
extern double d;
extern int i;

void
foo (void)
{
  d = f;
  f = i;
}

we need to generate

	vxorp[ds]	%xmmN, %xmmN, %xmmN
	...
	vcvtss2sd	f(%rip), %xmmN, %xmmX
	...
	vcvtsi2ss	i(%rip), %xmmN, %xmmY

to avoid partial XMM register stall.  This patch adds a pass to generate
a single

	vxorps		%xmmN, %xmmN, %xmmN

at entry of the nearest dominator for basic blocks with SF/DF conversions,
which is in the fake loop that contains the whole function, instead of
generating one

	vxorp[ds]	%xmmN, %xmmN, %xmmN

for each SF/DF conversion.

NB: The LCM algorithm isn't appropriate here since it may place a vxorps
inside the loop.  Simple testcase show this:

$ cat badcase.c

extern float f;
extern double d;

void
foo (int n, int k)
{
  for (int j = 0; j != n; j++)
    if (j < k)
      d = f;
}

It generates

    ...
    loop:
      if(j < k)
        vxorps    %xmm0, %xmm0, %xmm0
        vcvtss2sd f(%rip), %xmm0, %xmm0
      ...
    loopend
    ...

This is because LCM only works when there is a certain benifit.  But for
conditional branch, LCM wouldn't move

   vxorps  %xmm0, %xmm0, %xmm0

out of loop.  SPEC CPU 2017 on Intel Xeon with AVX512 shows:

1. The nearest dominator

|RATE			|Improvement|
|500.perlbench_r	| 0.55%	|
|538.imagick_r		| 8.43%	|
|544.nab_r		| 0.71%	|

2. LCM

|RATE			|Improvement|
|500.perlbench_r	| -0.76% |
|538.imagick_r		| 7.96%  |
|544.nab_r		| -0.13% |

Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512
using

-Ofast -flto -march=skylake-avx512 -funroll-loops

before

commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576
Author: uros <uros@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jan 31 20:06:42 2019 +0000

            PR target/89071
            * config/i386/i386.md (*extendsfdf2): Split out reg->reg
            alternative to avoid partial SSE register stall for TARGET_AVX.
            (truncdfsf2): Ditto.
            (sse4_1_round<mode>2): Ditto.

    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427 138bc75d-0d04-0410-961f-82ee72b054a4

are:

|INT RATE		|Improvement|
|500.perlbench_r	| 0.55%	|
|502.gcc_r		| 0.14%	|
|505.mcf_r		| 0.08%	|
|523.xalancbmk_r	| 0.18%	|
|525.x264_r		|-0.49%	|
|531.deepsjeng_r	|-0.04%	|
|541.leela_r		|-0.26%	|
|548.exchange2_r	|-0.3%	|
|557.xz_r		|BuildSame|

|FP RATE		|Improvement|
|503.bwaves_r	        |-0.29% |
|507.cactuBSSN_r	| 0.04%	|
|508.namd_r		|-0.74%	|
|510.parest_r		|-0.01%	|
|511.povray_r		| 2.23%	|
|519.lbm_r		| 0.1%	|
|521.wrf_r		| 0.49%	|
|526.blender_r		| 0.13%	|
|527.cam4_r		| 0.65%	|
|538.imagick_r		| 8.43%	|
|544.nab_r		| 0.71%	|
|549.fotonik3d_r	| 0.15%	|
|554.roms_r		| 0.08%	|

After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client,
impacts on 538.imagick_r with

-fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto

1. Size comparision:

before:

   text	   data	    bss	    dec	    hex	filename
2465633	   8352	   4528	2478513	 25d1b1 imagick_r

after:

   text	   data	    bss	    dec	    hex	filename
2447145	   8352	   4528	2460025	 258979 imagick_r

2. Number of vxorps:

before		after		difference
6890            5311            -29.73%

3. Performance improvement:

|RATE			|Improvement|
|538.imagick_r		| 4.87%  |

gcc/

2019-02-01  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* config/i386/i386-passes.def: Add
	pass_remove_partial_avx_dependency.
	* config/i386/i386-protos.h
	(make_pass_remove_partial_avx_dependency): New.
	* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
	New function.
	(pass_data_remove_partial_avx_dependency): New.
	(pass_remove_partial_avx_dependency): Likewise.
	(make_pass_remove_partial_avx_dependency): Likewise.
	* config/i386/i386.md (partial_xmm_update): New attribute.
	(*extendsfdf2): Add partial_xmm_update.
	(truncdfsf2): Likewise.
	(*float<SWI48:mode><MODEF:mode>2): Likewise.
	(SF/DF conversion splitters): Disabled for TARGET_AVX.

gcc/testsuite/

2019-02-01  H.J. Lu  <hongjiu.lu@intel.com>
	    Hongtao Liu  <hongtao.liu@intel.com>
	    Sunil K Pandey  <sunil.k.pandey@intel.com>

	PR target/87007
	* gcc.target/i386/pr87007-1.c: New test.
	* gcc.target/i386/pr87007-2.c: Likewise.
---
 gcc/config/i386/i386-passes.def           |   2 +
 gcc/config/i386/i386-protos.h             |   2 +
 gcc/config/i386/i386.c                    | 174 ++++++++++++++++++++++
 gcc/config/i386/i386.md                   |  16 +-
 gcc/testsuite/gcc.target/i386/pr87007-1.c |  15 ++
 gcc/testsuite/gcc.target/i386/pr87007-2.c |  18 +++
 6 files changed, 224 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-2.c

diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index 87cfd94b8f6..f4facdc65d4 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -31,3 +31,5 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
 
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbranch);
+
+  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 2d600173917..83645e89a81 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -369,3 +369,5 @@ class rtl_opt_pass;
 extern rtl_opt_pass *make_pass_insert_vzeroupper (gcc::context *);
 extern rtl_opt_pass *make_pass_stv (gcc::context *);
 extern rtl_opt_pass *make_pass_insert_endbranch (gcc::context *);
+extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
+  (gcc::context *);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 4e67abe8764..b8e39176c6a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2793,6 +2793,180 @@ make_pass_insert_endbranch (gcc::context *ctxt)
   return new pass_insert_endbranch (ctxt);
 }
 
+/* At entry of the nearest common dominator for basic blocks with
+   conversions, generate a single
+	vxorps %xmmN, %xmmN, %xmmN
+   for all
+	vcvtss2sd  op, %xmmN, %xmmX
+	vcvtsd2ss  op, %xmmN, %xmmX
+	vcvtsi2ss  op, %xmmN, %xmmX
+	vcvtsi2sd  op, %xmmN, %xmmX
+
+   NB: We want to generate only a single vxorps to cover the whole
+   function.  The LCM algorithm isn't appropriate here since it may
+   place a vxorps inside the loop.  */
+
+static unsigned int
+remove_partial_avx_dependency (void)
+{
+  timevar_push (TV_MACH_DEP);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_md_add_problem ();
+  df_analyze ();
+
+  bitmap_obstack_initialize (NULL);
+  bitmap convert_bbs = BITMAP_ALLOC (NULL);
+
+  basic_block bb;
+  rtx_insn *insn, *set_insn;
+  rtx set;
+  rtx v4sf_const0 = NULL_RTX;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  set = single_set (insn);
+	  if (!set)
+	    continue;
+
+	  if (get_attr_partial_xmm_update (insn)
+	      != PARTIAL_XMM_UPDATE_TRUE)
+	    continue;
+
+	  if (!v4sf_const0)
+	    v4sf_const0 = gen_reg_rtx (V4SFmode);
+
+	  /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
+	     SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
+	     vec_merge with subreg.  */
+	  rtx src = SET_SRC (set);
+	  rtx dest = SET_DEST (set);
+	  machine_mode dest_mode = GET_MODE (dest);
+
+	  rtx zero;
+	  machine_mode dest_vecmode;
+	  if (dest_mode == E_SFmode)
+	    {
+	      dest_vecmode = V4SFmode;
+	      zero = v4sf_const0;
+	    }
+	  else
+	    {
+	      dest_vecmode = V2DFmode;
+	      zero = gen_rtx_SUBREG (V2DFmode, v4sf_const0, 0);
+	    }
+
+	  /* Change source to vector mode.  */
+	  src = gen_rtx_VEC_DUPLICATE (dest_vecmode, src);
+	  src = gen_rtx_VEC_MERGE (dest_vecmode, src, zero,
+				   GEN_INT (HOST_WIDE_INT_1U));
+	  /* Change destination to vector mode.  */
+	  rtx vec = gen_reg_rtx (dest_vecmode);
+	  /* Generate an XMM vector SET.  */
+	  set = gen_rtx_SET (vec, src);
+	  set_insn = emit_insn_before (set, insn);
+	  df_insn_rescan (set_insn);
+
+	  src = gen_rtx_SUBREG (dest_mode, vec, 0);
+	  set = gen_rtx_SET (dest, src);
+
+	  /* Drop possible dead definitions.  */
+	  PATTERN (insn) = set;
+
+	  INSN_CODE (insn) = -1;
+	  recog_memoized (insn);
+	  df_insn_rescan (insn);
+	  bitmap_set_bit (convert_bbs, bb->index);
+	}
+    }
+
+  if (v4sf_const0)
+    {
+      /* (Re-)discover loops so that bb->loop_father can be used in the
+	 analysis below.  */
+      loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+
+      /* Generate a vxorps at entry of the nearest dominator for basic
+	 blocks with conversions, which is in the the fake loop that
+	 contains the whole function, so that there is only a single
+	 vxorps in the whole function.   */
+      bb = nearest_common_dominator_for_set (CDI_DOMINATORS,
+					     convert_bbs);
+      while (bb->loop_father->latch
+	     != EXIT_BLOCK_PTR_FOR_FN (cfun))
+	bb = get_immediate_dominator (CDI_DOMINATORS,
+				      bb->loop_father->header);
+
+      insn = BB_HEAD (bb);
+      if (!NONDEBUG_INSN_P (insn))
+	insn = next_nonnote_nondebug_insn (insn);
+      set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode));
+      set_insn = emit_insn_before (set, insn);
+      df_insn_rescan (set_insn);
+      df_process_deferred_rescans ();
+      loop_optimizer_finalize ();
+    }
+
+  bitmap_obstack_release (NULL);
+  BITMAP_FREE (convert_bbs);
+
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_remove_partial_avx_dependency =
+{
+  RTL_PASS, /* type */
+  "rpad", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_remove_partial_avx_dependency : public rtl_opt_pass
+{
+public:
+  pass_remove_partial_avx_dependency (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_remove_partial_avx_dependency, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (TARGET_AVX
+	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+	      && TARGET_SSE_MATH
+	      && optimize
+	      && optimize_function_for_speed_p (cfun));
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return remove_partial_avx_dependency ();
+    }
+}; // class pass_rpad
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
+{
+  return new pass_remove_partial_avx_dependency (ctxt);
+}
+
 /* Return true if a red-zone is in use.  We can't use red-zone when
    there are local indirect jumps, like "indirect_jump" or "tablejump",
    which jumps to another place in the function, since "call" in the
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 744f155fca6..f589bbe6e68 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -778,6 +778,10 @@
 (define_attr "i387_cw" "trunc,floor,ceil,uninitialized,any"
   (const_string "any"))
 
+;; Define attribute to indicate insns with partial XMM register update.
+(define_attr "partial_xmm_update" "false,true"
+  (const_string "false"))
+
 ;; Define attribute to classify add/sub insns that consumes carry flag (CF)
 (define_attr "use_carry" "0,1" (const_string "0"))
 
@@ -4391,6 +4395,7 @@
     }
 }
   [(set_attr "type" "fmov,fmov,ssecvt,ssecvt")
+   (set_attr "partial_xmm_update" "false,false,false,true")
    (set_attr "prefix" "orig,orig,maybe_vex,maybe_vex")
    (set_attr "mode" "SF,XF,DF,DF")
    (set (attr "enabled")
@@ -4480,7 +4485,8 @@
   [(set (match_operand:DF 0 "sse_reg_operand")
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -4557,6 +4563,7 @@
     }
 }
   [(set_attr "type" "fmov,fmov,ssecvt,ssecvt")
+   (set_attr "partial_xmm_update" "false,false,false,true")
    (set_attr "mode" "SF")
    (set (attr "enabled")
      (if_then_else
@@ -4640,7 +4647,8 @@
   [(set (match_operand:SF 0 "sse_reg_operand")
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -5016,6 +5024,7 @@
    %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}
    %vcvtsi2<MODEF:ssemodesuffix><SWI48:rex64suffix>\t{%1, %d0|%d0, %1}"
   [(set_attr "type" "fmov,sseicvt,sseicvt")
+   (set_attr "partial_xmm_update" "false,true,true")
    (set_attr "prefix" "orig,maybe_vex,maybe_vex")
    (set_attr "mode" "<MODEF:MODE>")
    (set (attr "prefix_rex")
@@ -5144,7 +5153,8 @@
 (define_split
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
-  "TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+  "!TARGET_AVX
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/testsuite/gcc.target/i386/pr87007-1.c b/gcc/testsuite/gcc.target/i386/pr87007-1.c
new file mode 100644
index 00000000000..93cf1dcdfa5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr87007-2.c b/gcc/testsuite/gcc.target/i386/pr87007-2.c
new file mode 100644
index 00000000000..cca7ae7afbc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr87007-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (int n, int k)
+{
+  for (int i = 0; i != n; i++)
+    if(i < k)
+      d = f;
+    else
+      f = i;
+}
+
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
-- 
2.20.1

Reply via email to