Re: [21/32] Remove global call sets: LRA

Uros Bizjak Sun, 06 Oct 2019 23:05:03 -0700

On Sun, Oct 6, 2019 at 4:32 PM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Uros Bizjak <ubiz...@gmail.com> writes:
> >>>> This caused:
> >>>>
> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91994
> >>
> >> Thanks for reducing & tracking down the underlying cause.
> >>
> >>> This change doesn't work with -mzeroupper.  When -mzeroupper is used,
> >>> upper bits of vector registers are clobbered upon callee return if any
> >>> MM/ZMM registers are used in callee.  Even if YMM7 isn't used, upper
> >>> bits of YMM7 can still be clobbered by vzeroupper when YMM1 is used.
> >>
> >> The problem here really is that the pattern is just:
> >>
> >> (define_insn "avx_vzeroupper"
> >>   [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)]
> >>   "TARGET_AVX"
> >>   "vzeroupper"
> >>   ...)
> >>
> >> and so its effect on the registers isn't modelled at all in rtl.
> >> Maybe one option would be to add a parallel:
> >>
> >>   (set (reg:V2DI N) (reg:V2DI N))
> >>
> >> for each register.  Or we could do something like I did for the SVE
> >> tlsdesc calls, although here that would mean using a call pattern for
> >> something that isn't really a call.  Or we could reinstate clobber_high
> >> and use that, but that's very much third out of three.
> >>
> >> I don't think we should add target hooks to get around this, since that's
> >> IMO papering over the issue.
> >>
> >> I'll try the parallel set thing first.
> >
> > Please note that vzeroupper insertion pass runs after register
> > allocation, so in effect vzeroupper pattern is hidden to the register
> > allocator.
>
> Right, but even post-RA passes rely on the register usage being accurate.
> Same for collect_fn_hard_reg_usage, which is the issue here.
>
> The info collected by collect_fn_hard_reg_usage was always wrong for
> vzeroupper.  What changed with my patch is that we now use that info
> for partly call-clobbered registers as well as "normally" clobbered
> registers.  So this is another instance of a problem that was previously
> being masked by having ix86_hard_regno_call_part_clobbered enforce Win64
> rules for all ABIs.
>
> My first idea of adding:
>
>   (set (reg:V2DI N) (reg:V2DI N))
>
> for all clobbered registers didn't work well because it left previously-
> dead registers upwards exposed (obvious in hindsight).  And the second
> idea of using a fake call would require too many "is this really a call?"
> hacks.
>
> So in the end I went for a subpass that chooses between:
>
>   (set (reg:V2DI N) (reg:V2DI N))
>
> and
>
>   (clobber (reg:V2DI N))
>
> depending on whether register N is live or not.  This fixes the testcase
> and doesn't seem to regress code quality for the tests I've tried.
>
> Tested on x86_64-linux-gnu.  OK to install?
>
> Richard
>
>
> 2019-10-06  Richard Sandiford  <richard.sandif...@arm.com>
>
> gcc/
>         PR target/91994
>         * config/i386/sse.md (avx_vzeroupper): Turn into a define_expand
>         and wrap the unspec_volatile in a parallel.
>         (*avx_vzeroupper): New define_insn.  Use a match_parallel around
>         the unspec_volatile.
>         * config/i386/predicates.md (vzeroupper_pattern): Expect the
>         unspec_volatile to be wrapped in a parallel.
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper)
>         (ix86_add_reg_usage_to_vzerouppers): New functions.
>         (rest_of_handle_insert_vzeroupper): Use them to add register
>         usage information to the vzeroupper instructions.
>
> gcc/testsuite/
>         PR target/91994
>         * gcc.target/i386/pr91994.c: New test.


LGTM.

Thanks,
Uros.

> Index: gcc/config/i386/sse.md
> ===================================================================
> --- gcc/config/i386/sse.md      2019-09-17 15:27:10.214075253 +0100
> +++ gcc/config/i386/sse.md      2019-10-06 15:19:10.062769500 +0100
> @@ -19622,9 +19622,16 @@ (define_insn "*avx_vzeroall"
>     (set_attr "mode" "OI")])
>
>  ;; Clear the upper 128bits of AVX registers, equivalent to a NOP
> -;; if the upper 128bits are unused.
> -(define_insn "avx_vzeroupper"
> -  [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)]
> +;; if the upper 128bits are unused.  Initially we expand the instructions
> +;; as though they had no effect on the SSE registers, but later add SETs and
> +;; CLOBBERs to the PARALLEL to model the real effect.
> +(define_expand "avx_vzeroupper"
> +  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> +  "TARGET_AVX")
> +
> +(define_insn "*avx_vzeroupper"
> +  [(match_parallel 0 "vzeroupper_pattern"
> +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
>    "TARGET_AVX"
>    "vzeroupper"
>    [(set_attr "type" "sse")
> Index: gcc/config/i386/predicates.md
> ===================================================================
> --- gcc/config/i386/predicates.md       2019-09-10 19:56:45.337178032 +0100
> +++ gcc/config/i386/predicates.md       2019-10-06 15:19:10.054769556 +0100
> @@ -1441,8 +1441,9 @@ (define_predicate "vzeroall_pattern"
>
>  ;; return true if OP is a vzeroupper pattern.
>  (define_predicate "vzeroupper_pattern"
> -  (and (match_code "unspec_volatile")
> -       (match_test "XINT (op, 1) == UNSPECV_VZEROUPPER")))
> +  (and (match_code "parallel")
> +       (match_code "unspec_volatile" "a")
> +       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
>
>  ;; Return true if OP is an addsub vec_merge operation
>  (define_predicate "addsub_vm_operator"
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c     2019-09-21 13:56:08.895934718 +0100
> +++ gcc/config/i386/i386-features.c     2019-10-06 15:19:10.054769556 +0100
> @@ -1757,6 +1757,68 @@ convert_scalars_to_vector (bool timode_p
>    return 0;
>  }
>
> +/* Modify the vzeroupper pattern in INSN so that it describes the effect
> +   that the instruction has on the SSE registers.  LIVE_REGS are the set
> +   of registers that are live across the instruction.
> +
> +   For a live register R we use:
> +
> +     (set (reg:V2DF R) (reg:V2DF R))
> +
> +   which preserves the low 128 bits but clobbers the upper bits.
> +   For a dead register we just use:
> +
> +     (clobber (reg:V2DF R))
> +
> +   which invalidates any previous contents of R and stops R from becoming
> +   live across the vzeroupper in future.  */
> +
> +static void
> +ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
> +{
> +  rtx pattern = PATTERN (insn);
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
> +  for (unsigned int i = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      rtx reg = gen_rtx_REG (V2DImode, regno);
> +      if (bitmap_bit_p (live_regs, regno))
> +       RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
> +      else
> +       RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +    }
> +  XVEC (pattern, 0) = vec;
> +  df_insn_rescan (insn);
> +}
> +
> +/* Walk the vzeroupper instructions in the function and annotate them
> +   with the effect that they have on the SSE registers.  */
> +
> +static void
> +ix86_add_reg_usage_to_vzerouppers (void)
> +{
> +  basic_block bb;
> +  rtx_insn *insn;
> +  auto_bitmap live_regs;
> +
> +  df_analyze ();
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      bitmap_copy (live_regs, df_get_live_out (bb));
> +      df_simulate_initialize_backwards (bb, live_regs);
> +      FOR_BB_INSNS_REVERSE (bb, insn)
> +       {
> +         if (!NONDEBUG_INSN_P (insn))
> +           continue;
> +         if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
> +           ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
> +         df_simulate_one_insn_backwards (bb, insn, live_regs);
> +       }
> +    }
> +}
> +
>  static unsigned int
>  rest_of_handle_insert_vzeroupper (void)
>  {
> @@ -1773,6 +1835,7 @@ rest_of_handle_insert_vzeroupper (void)
>
>    /* Call optimize_mode_switching.  */
>    g->get_passes ()->execute_pass_mode_switching ();
> +  ix86_add_reg_usage_to_vzerouppers ();
>    return 0;
>  }
>
> Index: gcc/testsuite/gcc.target/i386/pr91994.c
> ===================================================================
> --- /dev/null   2019-09-17 11:41:18.176664108 +0100
> +++ gcc/testsuite/gcc.target/i386/pr91994.c     2019-10-06 15:19:10.062769500 
> +0100
> @@ -0,0 +1,35 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx } */
> +/* { dg-options "-O2 -mavx -mvzeroupper" } */
> +
> +#include "avx-check.h"
> +
> +#include <immintrin.h>
> +
> +__m256i x1, x2, x3;
> +
> +__attribute__ ((noinline))
> +static void
> +foo (void)
> +{
> +  x1 = x2;
> +}
> +
> +void
> +bar (void)
> +{
> +  __m256i x = x1;
> +  foo ();
> +  x3 = x;
> +}
> +
> +__attribute__ ((noinline))
> +void
> +avx_test (void)
> +{
> +  __m256i x = _mm256_set1_epi8 (3);
> +  x1 = x;
> +  bar ();
> +  if (__builtin_memcmp (&x3, &x, sizeof (x)))
> +    __builtin_abort ();
> +}

Re: [21/32] Remove global call sets: LRA

Reply via email to