On Sun, Oct 6, 2019 at 4:32 PM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Uros Bizjak <ubiz...@gmail.com> writes: > >>>> This caused: > >>>> > >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91994 > >> > >> Thanks for reducing & tracking down the underlying cause. > >> > >>> This change doesn't work with -mzeroupper. When -mzeroupper is used, > >>> upper bits of vector registers are clobbered upon callee return if any > >>> MM/ZMM registers are used in callee. Even if YMM7 isn't used, upper > >>> bits of YMM7 can still be clobbered by vzeroupper when YMM1 is used. > >> > >> The problem here really is that the pattern is just: > >> > >> (define_insn "avx_vzeroupper" > >> [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)] > >> "TARGET_AVX" > >> "vzeroupper" > >> ...) > >> > >> and so its effect on the registers isn't modelled at all in rtl. > >> Maybe one option would be to add a parallel: > >> > >> (set (reg:V2DI N) (reg:V2DI N)) > >> > >> for each register. Or we could do something like I did for the SVE > >> tlsdesc calls, although here that would mean using a call pattern for > >> something that isn't really a call. Or we could reinstate clobber_high > >> and use that, but that's very much third out of three. > >> > >> I don't think we should add target hooks to get around this, since that's > >> IMO papering over the issue. > >> > >> I'll try the parallel set thing first. > > > > Please note that vzeroupper insertion pass runs after register > > allocation, so in effect vzeroupper pattern is hidden to the register > > allocator. > > Right, but even post-RA passes rely on the register usage being accurate. > Same for collect_fn_hard_reg_usage, which is the issue here. > > The info collected by collect_fn_hard_reg_usage was always wrong for > vzeroupper. What changed with my patch is that we now use that info > for partly call-clobbered registers as well as "normally" clobbered > registers. So this is another instance of a problem that was previously > being masked by having ix86_hard_regno_call_part_clobbered enforce Win64 > rules for all ABIs. > > My first idea of adding: > > (set (reg:V2DI N) (reg:V2DI N)) > > for all clobbered registers didn't work well because it left previously- > dead registers upwards exposed (obvious in hindsight). And the second > idea of using a fake call would require too many "is this really a call?" > hacks. > > So in the end I went for a subpass that chooses between: > > (set (reg:V2DI N) (reg:V2DI N)) > > and > > (clobber (reg:V2DI N)) > > depending on whether register N is live or not. This fixes the testcase > and doesn't seem to regress code quality for the tests I've tried. > > Tested on x86_64-linux-gnu. OK to install? > > Richard > > > 2019-10-06 Richard Sandiford <richard.sandif...@arm.com> > > gcc/ > PR target/91994 > * config/i386/sse.md (avx_vzeroupper): Turn into a define_expand > and wrap the unspec_volatile in a parallel. > (*avx_vzeroupper): New define_insn. Use a match_parallel around > the unspec_volatile. > * config/i386/predicates.md (vzeroupper_pattern): Expect the > unspec_volatile to be wrapped in a parallel. > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper) > (ix86_add_reg_usage_to_vzerouppers): New functions. > (rest_of_handle_insert_vzeroupper): Use them to add register > usage information to the vzeroupper instructions. > > gcc/testsuite/ > PR target/91994 > * gcc.target/i386/pr91994.c: New test.
LGTM. Thanks, Uros. > Index: gcc/config/i386/sse.md > =================================================================== > --- gcc/config/i386/sse.md 2019-09-17 15:27:10.214075253 +0100 > +++ gcc/config/i386/sse.md 2019-10-06 15:19:10.062769500 +0100 > @@ -19622,9 +19622,16 @@ (define_insn "*avx_vzeroall" > (set_attr "mode" "OI")]) > > ;; Clear the upper 128bits of AVX registers, equivalent to a NOP > -;; if the upper 128bits are unused. > -(define_insn "avx_vzeroupper" > - [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)] > +;; if the upper 128bits are unused. Initially we expand the instructions > +;; as though they had no effect on the SSE registers, but later add SETs and > +;; CLOBBERs to the PARALLEL to model the real effect. > +(define_expand "avx_vzeroupper" > + [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > + "TARGET_AVX") > + > +(define_insn "*avx_vzeroupper" > + [(match_parallel 0 "vzeroupper_pattern" > + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > "TARGET_AVX" > "vzeroupper" > [(set_attr "type" "sse") > Index: gcc/config/i386/predicates.md > =================================================================== > --- gcc/config/i386/predicates.md 2019-09-10 19:56:45.337178032 +0100 > +++ gcc/config/i386/predicates.md 2019-10-06 15:19:10.054769556 +0100 > @@ -1441,8 +1441,9 @@ (define_predicate "vzeroall_pattern" > > ;; return true if OP is a vzeroupper pattern. > (define_predicate "vzeroupper_pattern" > - (and (match_code "unspec_volatile") > - (match_test "XINT (op, 1) == UNSPECV_VZEROUPPER"))) > + (and (match_code "parallel") > + (match_code "unspec_volatile" "a") > + (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER"))) > > ;; Return true if OP is an addsub vec_merge operation > (define_predicate "addsub_vm_operator" > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c 2019-09-21 13:56:08.895934718 +0100 > +++ gcc/config/i386/i386-features.c 2019-10-06 15:19:10.054769556 +0100 > @@ -1757,6 +1757,68 @@ convert_scalars_to_vector (bool timode_p > return 0; > } > > +/* Modify the vzeroupper pattern in INSN so that it describes the effect > + that the instruction has on the SSE registers. LIVE_REGS are the set > + of registers that are live across the instruction. > + > + For a live register R we use: > + > + (set (reg:V2DF R) (reg:V2DF R)) > + > + which preserves the low 128 bits but clobbers the upper bits. > + For a dead register we just use: > + > + (clobber (reg:V2DF R)) > + > + which invalidates any previous contents of R and stops R from becoming > + live across the vzeroupper in future. */ > + > +static void > +ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) > +{ > + rtx pattern = PATTERN (insn); > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > + rtvec vec = rtvec_alloc (nregs + 1); > + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); > + for (unsigned int i = 0; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + rtx reg = gen_rtx_REG (V2DImode, regno); > + if (bitmap_bit_p (live_regs, regno)) > + RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); > + else > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + } > + XVEC (pattern, 0) = vec; > + df_insn_rescan (insn); > +} > + > +/* Walk the vzeroupper instructions in the function and annotate them > + with the effect that they have on the SSE registers. */ > + > +static void > +ix86_add_reg_usage_to_vzerouppers (void) > +{ > + basic_block bb; > + rtx_insn *insn; > + auto_bitmap live_regs; > + > + df_analyze (); > + FOR_EACH_BB_FN (bb, cfun) > + { > + bitmap_copy (live_regs, df_get_live_out (bb)); > + df_simulate_initialize_backwards (bb, live_regs); > + FOR_BB_INSNS_REVERSE (bb, insn) > + { > + if (!NONDEBUG_INSN_P (insn)) > + continue; > + if (vzeroupper_pattern (PATTERN (insn), VOIDmode)) > + ix86_add_reg_usage_to_vzeroupper (insn, live_regs); > + df_simulate_one_insn_backwards (bb, insn, live_regs); > + } > + } > +} > + > static unsigned int > rest_of_handle_insert_vzeroupper (void) > { > @@ -1773,6 +1835,7 @@ rest_of_handle_insert_vzeroupper (void) > > /* Call optimize_mode_switching. */ > g->get_passes ()->execute_pass_mode_switching (); > + ix86_add_reg_usage_to_vzerouppers (); > return 0; > } > > Index: gcc/testsuite/gcc.target/i386/pr91994.c > =================================================================== > --- /dev/null 2019-09-17 11:41:18.176664108 +0100 > +++ gcc/testsuite/gcc.target/i386/pr91994.c 2019-10-06 15:19:10.062769500 > +0100 > @@ -0,0 +1,35 @@ > +/* { dg-do run } */ > +/* { dg-require-effective-target avx } */ > +/* { dg-options "-O2 -mavx -mvzeroupper" } */ > + > +#include "avx-check.h" > + > +#include <immintrin.h> > + > +__m256i x1, x2, x3; > + > +__attribute__ ((noinline)) > +static void > +foo (void) > +{ > + x1 = x2; > +} > + > +void > +bar (void) > +{ > + __m256i x = x1; > + foo (); > + x3 = x; > +} > + > +__attribute__ ((noinline)) > +void > +avx_test (void) > +{ > + __m256i x = _mm256_set1_epi8 (3); > + x1 = x; > + bar (); > + if (__builtin_memcmp (&x3, &x, sizeof (x))) > + __builtin_abort (); > +}