Uros Bizjak <ubiz...@gmail.com> writes: >>>> This caused: >>>> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91994 >> >> Thanks for reducing & tracking down the underlying cause. >> >>> This change doesn't work with -mzeroupper. When -mzeroupper is used, >>> upper bits of vector registers are clobbered upon callee return if any >>> MM/ZMM registers are used in callee. Even if YMM7 isn't used, upper >>> bits of YMM7 can still be clobbered by vzeroupper when YMM1 is used. >> >> The problem here really is that the pattern is just: >> >> (define_insn "avx_vzeroupper" >> [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)] >> "TARGET_AVX" >> "vzeroupper" >> ...) >> >> and so its effect on the registers isn't modelled at all in rtl. >> Maybe one option would be to add a parallel: >> >> (set (reg:V2DI N) (reg:V2DI N)) >> >> for each register. Or we could do something like I did for the SVE >> tlsdesc calls, although here that would mean using a call pattern for >> something that isn't really a call. Or we could reinstate clobber_high >> and use that, but that's very much third out of three. >> >> I don't think we should add target hooks to get around this, since that's >> IMO papering over the issue. >> >> I'll try the parallel set thing first. > > Please note that vzeroupper insertion pass runs after register > allocation, so in effect vzeroupper pattern is hidden to the register > allocator.
Right, but even post-RA passes rely on the register usage being accurate. Same for collect_fn_hard_reg_usage, which is the issue here. The info collected by collect_fn_hard_reg_usage was always wrong for vzeroupper. What changed with my patch is that we now use that info for partly call-clobbered registers as well as "normally" clobbered registers. So this is another instance of a problem that was previously being masked by having ix86_hard_regno_call_part_clobbered enforce Win64 rules for all ABIs. My first idea of adding: (set (reg:V2DI N) (reg:V2DI N)) for all clobbered registers didn't work well because it left previously- dead registers upwards exposed (obvious in hindsight). And the second idea of using a fake call would require too many "is this really a call?" hacks. So in the end I went for a subpass that chooses between: (set (reg:V2DI N) (reg:V2DI N)) and (clobber (reg:V2DI N)) depending on whether register N is live or not. This fixes the testcase and doesn't seem to regress code quality for the tests I've tried. Tested on x86_64-linux-gnu. OK to install? Richard 2019-10-06 Richard Sandiford <richard.sandif...@arm.com> gcc/ PR target/91994 * config/i386/sse.md (avx_vzeroupper): Turn into a define_expand and wrap the unspec_volatile in a parallel. (*avx_vzeroupper): New define_insn. Use a match_parallel around the unspec_volatile. * config/i386/predicates.md (vzeroupper_pattern): Expect the unspec_volatile to be wrapped in a parallel. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper) (ix86_add_reg_usage_to_vzerouppers): New functions. (rest_of_handle_insert_vzeroupper): Use them to add register usage information to the vzeroupper instructions. gcc/testsuite/ PR target/91994 * gcc.target/i386/pr91994.c: New test. Index: gcc/config/i386/sse.md =================================================================== --- gcc/config/i386/sse.md 2019-09-17 15:27:10.214075253 +0100 +++ gcc/config/i386/sse.md 2019-10-06 15:19:10.062769500 +0100 @@ -19622,9 +19622,16 @@ (define_insn "*avx_vzeroall" (set_attr "mode" "OI")]) ;; Clear the upper 128bits of AVX registers, equivalent to a NOP -;; if the upper 128bits are unused. -(define_insn "avx_vzeroupper" - [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)] +;; if the upper 128bits are unused. Initially we expand the instructions +;; as though they had no effect on the SSE registers, but later add SETs and +;; CLOBBERs to the PARALLEL to model the real effect. +(define_expand "avx_vzeroupper" + [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX") + +(define_insn "*avx_vzeroupper" + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] "TARGET_AVX" "vzeroupper" [(set_attr "type" "sse") Index: gcc/config/i386/predicates.md =================================================================== --- gcc/config/i386/predicates.md 2019-09-10 19:56:45.337178032 +0100 +++ gcc/config/i386/predicates.md 2019-10-06 15:19:10.054769556 +0100 @@ -1441,8 +1441,9 @@ (define_predicate "vzeroall_pattern" ;; return true if OP is a vzeroupper pattern. (define_predicate "vzeroupper_pattern" - (and (match_code "unspec_volatile") - (match_test "XINT (op, 1) == UNSPECV_VZEROUPPER"))) + (and (match_code "parallel") + (match_code "unspec_volatile" "a") + (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER"))) ;; Return true if OP is an addsub vec_merge operation (define_predicate "addsub_vm_operator" Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c 2019-09-21 13:56:08.895934718 +0100 +++ gcc/config/i386/i386-features.c 2019-10-06 15:19:10.054769556 +0100 @@ -1757,6 +1757,68 @@ convert_scalars_to_vector (bool timode_p return 0; } +/* Modify the vzeroupper pattern in INSN so that it describes the effect + that the instruction has on the SSE registers. LIVE_REGS are the set + of registers that are live across the instruction. + + For a live register R we use: + + (set (reg:V2DF R) (reg:V2DF R)) + + which preserves the low 128 bits but clobbers the upper bits. + For a dead register we just use: + + (clobber (reg:V2DF R)) + + which invalidates any previous contents of R and stops R from becoming + live across the vzeroupper in future. */ + +static void +ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs) +{ + rtx pattern = PATTERN (insn); + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + if (bitmap_bit_p (live_regs, regno)) + RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); + else + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + XVEC (pattern, 0) = vec; + df_insn_rescan (insn); +} + +/* Walk the vzeroupper instructions in the function and annotate them + with the effect that they have on the SSE registers. */ + +static void +ix86_add_reg_usage_to_vzerouppers (void) +{ + basic_block bb; + rtx_insn *insn; + auto_bitmap live_regs; + + df_analyze (); + FOR_EACH_BB_FN (bb, cfun) + { + bitmap_copy (live_regs, df_get_live_out (bb)); + df_simulate_initialize_backwards (bb, live_regs); + FOR_BB_INSNS_REVERSE (bb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + if (vzeroupper_pattern (PATTERN (insn), VOIDmode)) + ix86_add_reg_usage_to_vzeroupper (insn, live_regs); + df_simulate_one_insn_backwards (bb, insn, live_regs); + } + } +} + static unsigned int rest_of_handle_insert_vzeroupper (void) { @@ -1773,6 +1835,7 @@ rest_of_handle_insert_vzeroupper (void) /* Call optimize_mode_switching. */ g->get_passes ()->execute_pass_mode_switching (); + ix86_add_reg_usage_to_vzerouppers (); return 0; } Index: gcc/testsuite/gcc.target/i386/pr91994.c =================================================================== --- /dev/null 2019-09-17 11:41:18.176664108 +0100 +++ gcc/testsuite/gcc.target/i386/pr91994.c 2019-10-06 15:19:10.062769500 +0100 @@ -0,0 +1,35 @@ +/* { dg-do run } */ +/* { dg-require-effective-target avx } */ +/* { dg-options "-O2 -mavx -mvzeroupper" } */ + +#include "avx-check.h" + +#include <immintrin.h> + +__m256i x1, x2, x3; + +__attribute__ ((noinline)) +static void +foo (void) +{ + x1 = x2; +} + +void +bar (void) +{ + __m256i x = x1; + foo (); + x3 = x; +} + +__attribute__ ((noinline)) +void +avx_test (void) +{ + __m256i x = _mm256_set1_epi8 (3); + x1 = x; + bar (); + if (__builtin_memcmp (&x3, &x, sizeof (x))) + __builtin_abort (); +}