Re: [PATCH V2] x86: Enable separate shrink wrapping

Uros Bizjak Wed, 11 Jun 2025 00:02:09 -0700

On Wed, Jun 11, 2025 at 5:33 AM Cui, Lili <lili....@intel.com> wrote:
>
> From: Lili Cui <lili....@intel.com>
>
> Hi Uros,
>
> Thank you very much for providing detailed BKM to reproduce Linux kernel boot 
> failure.  My patch and Matz's patch have this problem. We inserted a SUB 
> between TEST and JLE, and the SUB changes the value of EFlags. The branch JLE 
> here went wrong, and a null pointer appeared after passing through many 
> functions. I created a small case for it and added it into my patch. I use 
> lea instead of sub/add to fix this issue. Shrink-wrap also benefits from the 
> lea instruction.


Thanks for your thorough testing! I think that introducing a boot test
for the linux kernel (besides SPEC tests) in your testing is a great
addition, and I would recommend to perform it especially for an
invasive change like this. In addition, Intel has a continuous
integration testing for the linux kernel patches, perhaps something
similar can be used to boot and test linux with GCC changes.

> Real case:
> Avoid inserting sub between test-je-jle to change EFlags, lea should be used 
> here
>         xorl    %eax, %eax
>         testl   %edi, %edi
>         je      .L11
>         sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp

FAOD, you mean to insert SUB between TESTL and JE? The above case is
not problematic, because flags are produced and consumed without being
clobbered.

>         movq    %r13, 8(%rsp)
>         movl    $1, %r13d
>         jle     .L4
>
> GCC change:
> Emit lea in ix86_expand_prologue/ ix86_expand_epilogue, peephole2 will decide 
> whether to replace lea with sub/add depending on the situation.
>
>
> I learned Matz's patch, the effect we achieved is basically the same, 
> although there are some differences in the implementation methods.
>
> I incorporated some of Matz's good ideas into my patch.
> 1. Matz added a lot of assertions to detect errors in time, and I also added 
> some in this patch.
> 2. Matz turned off shrink-wrap-separate judgments in some special scenarios, 
> and I also added these judgments.
>
> Still keep the difference:
> 1. The position of inserting SUB is different. Matz inserts SUB at the end of 
> prolog, I put SUB at the first line of prologue. I keep my way here.
> 2. Matz handles both SSE and general registers, while I only handle general 
> registers. Enabling SSE registers is easy, but the SSE part seems to be 
> Windows ABI related. I have no way to test it yet, so I won't add it for now.
> 3.  After reloading, some registers were eliminated by optimization are still 
> placed in regs_ever_live, resulting in redundant push/pop when generating 
> prologue later, my patch will eliminate these redundant registers and keep 
> the stack frame layout the same as before.
>
> Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance 
> regression was observed.
> For O2 multi-copy :
> 511.povray_r improved by 2.8% on ZNVER5.
> 511.povray_r improved by 4% on EMR
> 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
>
> Bootstrapped & regtested on x86-64-pc-linux-gnu.
> Use this patch to build the latest Linux kernel and boot successfully.
>
> Thanks,
> Lili.
>
>
> This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
> enable separate shrink wrapping for function prologues/epilogues in
> x86.
>
> When performing separate shrink wrapping, we choose to use mov instead
> of push/pop, because using push/pop is more complicated to handle rsp
> adjustment and may lose performance, so here we choose to use mov, which
> has a small impact on code size, but guarantees performance.
>
> Using mov means we need to use sub/add to maintain the stack frame. In
> some special cases, we need to use lea to prevent affecting EFlags.
>
> Avoid inserting sub between test-je-jle to change EFlags, lea should be
> used here.
>
>     foo:
>         xorl    %eax, %eax
>         testl   %edi, %edi
>         je      .L11
>         sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp

Also here, is the placement of SUB correct?

>         movq    %r13, 8(%rsp)
>         movl    $1, %r13d
>         jle     .L4
>
> Tested against SPEC CPU 2017, this change always has a net-positive
> effect on the dynamic instruction count.  See the following table for
> the breakdown on how this reduces the number of dynamic instructions
> per workload on a like-for-like (with/without this commit):
>
> instruction count       base            with commit (commit-base)/commit
> 502.gcc_r               98666845943     96891561634     -1.80%
> 526.blender_r           6.21226E+11     6.12992E+11     -1.33%
> 520.omnetpp_r           1.1241E+11      1.11093E+11     -1.17%
> 500.perlbench_r         1271558717      1263268350      -0.65%
> 523.xalancbmk_r         2.20103E+11     2.18836E+11     -0.58%
> 531.deepsjeng_r         2.73591E+11     2.72114E+11     -0.54%
> 500.perlbench_r         64195557393     63881512409     -0.49%
> 541.leela_r             2.99097E+11     2.98245E+11     -0.29%
> 548.exchange2_r         1.27976E+11     1.27784E+11     -0.15%
> 527.cam4_r              88981458425     88887334679     -0.11%
> 554.roms_r              2.60072E+11     2.59809E+11     -0.10%
>
> Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance 
> regression was observed.
>
> For O2 multi-copy :
> 511.povray_r improved by 2.8% on ZNVER5.
> 511.povray_r improved by 4% on EMR
> 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
>
> gcc/ChangeLog:
>
>         * config/i386/i386-protos.h (ix86_get_separate_components):
>         New function.
>         (ix86_components_for_bb): Likewise.
>         (ix86_disqualify_components): Likewise.
>         (ix86_emit_prologue_components): Likewise.
>         (ix86_emit_epilogue_components): Likewise.
>         (ix86_set_handled_components): Likewise.
>         * config/i386/i386.cc (save_regs_using_push_pop):
>         Encapsulate code.

Better say: "Split from ix86_compute_frame_layout."

>         (ix86_compute_frame_layout):
>         Handle save_regs_using_push_pop.

Use save_regs_using_push_pop.

>         (ix86_emit_save_regs_using_mov):
>         Skip registers that are wrapped separately.
>         (pro_epilogue_adjust_stack): Emit insn without clobber.
>         (ix86_expand_prologue): Likewise.

Looks like "Likewise" applies to ix86_emit_save_regs_using_mov. There
are several other changes in ix86_expand_prologue. Please document
them in the ChangeLog entry.

>         (ix86_emit_restore_regs_using_mov): Likewise.
>         (ix86_expand_epilogue): Likewise.

Also in the above tow functions.

>         (ix86_get_separate_components): New function.
>         (ix86_components_for_bb): Likewise.
>         (ix86_disqualify_components): Likewise.
>         (ix86_emit_prologue_components): Likewise.
>         (ix86_emit_epilogue_components): Likewise.
>         (ix86_set_handled_components): Likewise.
>         (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
>         (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
>         (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
>         (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
>         (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
>         (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
>         * config/i386/i386.h (struct machine_function):Add
>         reg_is_wrapped_separately array for register wrapping
>         information.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
>         * gcc.target/i386/interrupt-16.c: Likewise.
>         * gfortran.dg/guality/arg1.f90: Likewise.
>         * g++.target/i386/shrink_wrap_separate.C: New test.
>         * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.
>
> Co-authored-by: Michael Matz <m...@suse.de>
> ---
>  gcc/config/i386/i386-protos.h                 |   7 +
>  gcc/config/i386/i386.cc                       | 321 +++++++++++++++---
>  gcc/config/i386/i386.h                        |   1 +
>  .../g++.target/i386/shrink_wrap_separate.C    |  24 ++
>  gcc/testsuite/gcc.target/i386/interrupt-16.c  |   4 +-
>  .../i386/shrink_wrap_separate_check_lea.c     |  29 ++
>  .../gcc.target/x86_64/abi/callabi/leaf-2.c    |   2 +-
>  gcc/testsuite/gfortran.dg/guality/arg1.f90    |   2 +-
>  8 files changed, 343 insertions(+), 47 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 10863ab9e9d..86194b3d319 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -437,6 +437,13 @@ extern rtl_opt_pass *make_pass_align_tight_loops 
> (gcc::context *);
>  extern bool ix86_has_no_direct_extern_access;
>  extern bool ix86_rpad_gate ();
>
> +extern sbitmap ix86_get_separate_components (void);
> +extern sbitmap ix86_components_for_bb (basic_block);
> +extern void ix86_disqualify_components (sbitmap, edge, sbitmap, bool);
> +extern void ix86_emit_prologue_components (sbitmap);
> +extern void ix86_emit_epilogue_components (sbitmap);
> +extern void ix86_set_handled_components (sbitmap);
> +
>  /* In i386-expand.cc.  */
>  bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*,
>                                    HOST_WIDE_INT*);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d48654a729a..8f2193efe29 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -6905,6 +6905,26 @@ ix86_pro_and_epilogue_can_use_push2pop2 (int nregs)
>          && (nregs + aligned) >= 3;
>  }
>
> +/* Check if push/pop should be used to save/restore registers.  */
> +static bool
> +save_regs_using_push_pop (HOST_WIDE_INT to_allocate)
> +{
> +  return ((!to_allocate && cfun->machine->frame.nregs <= 1)
> +         || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
> +         /* If static stack checking is enabled and done with probes,
> +            the registers need to be saved before allocating the frame.  */
> +         || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
> +         /* If stack clash probing needs a loop, then it needs a
> +            scratch register.  But the returned register is only guaranteed
> +            to be safe to use after register saves are complete.  So if
> +            stack clash protections are enabled and the allocated frame is
> +            larger than the probe interval, then use pushes to save
> +            callee saved registers.  */
> +         || (flag_stack_clash_protection
> +             && !ix86_target_stack_probe ()
> +             && to_allocate > get_probe_interval ()));
> +}
> +
>  /* Fill structure ix86_frame about frame of currently computed function.  */
>
>  static void
> @@ -7189,20 +7209,7 @@ ix86_compute_frame_layout (void)
>    /* Size prologue needs to allocate.  */
>    to_allocate = offset - frame->sse_reg_save_offset;
>
> -  if ((!to_allocate && frame->nregs <= 1)
> -      || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
> -       /* If static stack checking is enabled and done with probes,
> -         the registers need to be saved before allocating the frame.  */
> -      || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
> -      /* If stack clash probing needs a loop, then it needs a
> -        scratch register.  But the returned register is only guaranteed
> -        to be safe to use after register saves are complete.  So if
> -        stack clash protections are enabled and the allocated frame is
> -        larger than the probe interval, then use pushes to save
> -        callee saved registers.  */
> -      || (flag_stack_clash_protection
> -         && !ix86_target_stack_probe ()
> -         && to_allocate > get_probe_interval ()))
> +  if (save_regs_using_push_pop (to_allocate))
>      frame->save_regs_using_mov = false;
>
>    if (ix86_using_red_zone ()
> @@ -7660,7 +7667,9 @@ ix86_emit_save_regs_using_mov (HOST_WIDE_INT cfa_offset)
>    for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>      if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
>        {
> -        ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
> +       /* Skip registers, already processed by shrink wrap separate.  */
> +       if (!cfun->machine->reg_is_wrapped_separately[regno])
> +         ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
>         cfa_offset -= UNITS_PER_WORD;
>        }
>  }
> @@ -7753,8 +7762,12 @@ pro_epilogue_adjust_stack (rtx dest, rtx src, rtx 
> offset,
>         add_frame_related_expr = true;
>      }
>
> +  if (crtl->shrink_wrapped_separate)
> +  insn = emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (Pmode, src, addend)));

Please use ix86_expand_binary_operator here, it will correctly
generate LEA during pro/epilogue generation. Also, please check
indenting.

> +  else
>    insn = emit_insn (gen_pro_epilogue_adjust_stack_add
>                     (Pmode, dest, src, addend));
> +
>    if (style >= 0)
>      ix86_add_queued_cfa_restore_notes (insn);
>
> @@ -9218,13 +9231,30 @@ ix86_expand_prologue (void)
>          the stack frame saving one cycle of the prologue.  However, avoid
>          doing this if we have to probe the stack; at least on x86_64 the
>          stack probe can turn into a call that clobbers a red zone location. 
> */
> -      else if (ix86_using_red_zone ()
> -              && (! TARGET_STACK_PROBE
> -                  || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> +      else if ((ix86_using_red_zone ()
> +               && (! TARGET_STACK_PROBE
> +                   || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> +              || crtl->shrink_wrapped_separate)
>         {
> +         HOST_WIDE_INT allocate_offset;
> +         if (crtl->shrink_wrapped_separate)
> +           {
> +             allocate_offset = m->fs.sp_offset - frame.stack_pointer_offset;
> +
> +             /* Adjust the total offset at the beginning of the function.  */
> +             pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
> +                                        GEN_INT (allocate_offset), -1,
> +                                        m->fs.cfa_reg == stack_pointer_rtx);
> +             m->fs.sp_offset = cfun->machine->frame.stack_pointer_offset;
> +           }
> +
>           ix86_emit_save_regs_using_mov (frame.reg_save_offset);
> -         cfun->machine->red_zone_used = true;
>           int_registers_saved = true;
> +
> +         if (ix86_using_red_zone ()
> +             && (! TARGET_STACK_PROBE
> +                 || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> +           cfun->machine->red_zone_used = true;
>         }
>      }
>
> @@ -9344,6 +9374,7 @@ ix86_expand_prologue (void)
>        && flag_stack_clash_protection
>        && !ix86_target_stack_probe ())
>      {
> +      gcc_assert (!crtl->shrink_wrapped_separate);

Please put some vertical space after gcc_assert, so it will stand out more...

>        ix86_adjust_stack_and_probe (allocate, int_registers_saved, false);
>        allocate = 0;
>      }
> @@ -9354,6 +9385,8 @@ ix86_expand_prologue (void)
>      {
>        const HOST_WIDE_INT probe_interval = get_probe_interval ();
>
> +      gcc_assert (!crtl->shrink_wrapped_separate);
> +
>        if (STACK_CHECK_MOVING_SP)
>         {
>           if (crtl->is_leaf
> @@ -9410,12 +9443,14 @@ ix86_expand_prologue (void)
>    else if (!ix86_target_stack_probe ()
>            || frame.stack_pointer_offset < CHECK_STACK_LIMIT)
>      {
> +      gcc_assert (!crtl->shrink_wrapped_separate);

... also here and in other places.

>        pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
>                                  GEN_INT (-allocate), -1,
>                                  m->fs.cfa_reg == stack_pointer_rtx);
>      }
>    else
>      {
> +      gcc_assert (!crtl->shrink_wrapped_separate);
>        rtx eax = gen_rtx_REG (Pmode, AX_REG);
>        rtx r10 = NULL;
>        const bool sp_is_cfa_reg = (m->fs.cfa_reg == stack_pointer_rtx);
> @@ -9801,30 +9836,35 @@ ix86_emit_restore_regs_using_mov (HOST_WIDE_INT 
> cfa_offset,
>    for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>      if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, maybe_eh_return, 
> true))
>        {
> -       rtx reg = gen_rtx_REG (word_mode, regno);
> -       rtx mem;
> -       rtx_insn *insn;
> -
> -       mem = choose_baseaddr (cfa_offset, NULL);
> -       mem = gen_frame_mem (word_mode, mem);
> -       insn = emit_move_insn (reg, mem);
>
> -        if (m->fs.cfa_reg == crtl->drap_reg && regno == REGNO 
> (crtl->drap_reg))
> +       /* Skip registers, already processed by shrink wrap separate.  */
> +       if (!cfun->machine->reg_is_wrapped_separately[regno])
>           {
> -           /* Previously we'd represented the CFA as an expression
> -              like *(%ebp - 8).  We've just popped that value from
> -              the stack, which means we need to reset the CFA to
> -              the drap register.  This will remain until we restore
> -              the stack pointer.  */
> -           add_reg_note (insn, REG_CFA_DEF_CFA, reg);
> -           RTX_FRAME_RELATED_P (insn) = 1;
> +           rtx reg = gen_rtx_REG (word_mode, regno);
> +           rtx mem;
> +           rtx_insn *insn;
>
> -           /* This means that the DRAP register is valid for addressing.  */
> -           m->fs.drap_valid = true;
> -         }
> -       else
> -         ix86_add_cfa_restore_note (NULL, reg, cfa_offset);
> +           mem = choose_baseaddr (cfa_offset, NULL);
> +           mem = gen_frame_mem (word_mode, mem);
> +           insn = emit_move_insn (reg, mem);
>
> +           if (m->fs.cfa_reg == crtl->drap_reg
> +               && regno == REGNO (crtl->drap_reg))
> +             {
> +               /* Previously we'd represented the CFA as an expression
> +                  like *(%ebp - 8).  We've just popped that value from
> +                  the stack, which means we need to reset the CFA to
> +                  the drap register.  This will remain until we restore
> +                  the stack pointer.  */
> +               add_reg_note (insn, REG_CFA_DEF_CFA, reg);
> +               RTX_FRAME_RELATED_P (insn) = 1;
> +
> +               /* DRAP register is valid for addressing.  */
> +               m->fs.drap_valid = true;
> +             }
> +           else
> +             ix86_add_cfa_restore_note (NULL, reg, cfa_offset);
> +         }
>         cfa_offset -= UNITS_PER_WORD;
>        }
>  }
> @@ -10103,10 +10143,11 @@ ix86_expand_epilogue (int style)
>       less work than reloading sp and popping the register.  */
>    else if (!sp_valid_at (frame.hfp_save_offset) && frame.nregs <= 1)
>      restore_regs_via_mov = true;
> -  else if (TARGET_EPILOGUE_USING_MOVE
> -          && cfun->machine->use_fast_prologue_epilogue
> -          && (frame.nregs > 1
> -              || m->fs.sp_offset != reg_save_offset))
> +  else if (crtl->shrink_wrapped_separate
> +          || (TARGET_EPILOGUE_USING_MOVE
> +              && cfun->machine->use_fast_prologue_epilogue
> +              && (frame.nregs > 1
> +                  || m->fs.sp_offset != reg_save_offset)))
>      restore_regs_via_mov = true;
>    else if (frame_pointer_needed
>            && !frame.nregs
> @@ -10120,6 +10161,9 @@ ix86_expand_epilogue (int style)
>    else
>      restore_regs_via_mov = false;
>
> +  if (crtl->shrink_wrapped_separate)
> +    gcc_assert (restore_regs_via_mov);
> +
>    if (restore_regs_via_mov || frame.nsseregs)
>      {
>        /* Ensure that the entire register save area is addressable via
> @@ -10172,6 +10216,7 @@ ix86_expand_epilogue (int style)
>        gcc_assert (m->fs.sp_offset == UNITS_PER_WORD);
>        gcc_assert (!crtl->drap_reg);
>        gcc_assert (!frame.nregs);
> +      gcc_assert (!crtl->shrink_wrapped_separate);
>      }
>    else if (restore_regs_via_mov)
>      {
> @@ -10186,6 +10231,8 @@ ix86_expand_epilogue (int style)
>           rtx sa = EH_RETURN_STACKADJ_RTX;
>           rtx_insn *insn;
>
> +         gcc_assert (!crtl->shrink_wrapped_separate);
> +
>           /* Stack realignment doesn't work with eh_return.  */
>           if (crtl->stack_realign_needed)
>             sorry ("Stack realignment not supported with "
> @@ -28066,6 +28113,194 @@ ix86_cannot_copy_insn_p (rtx_insn *insn)
>  #undef TARGET_DOCUMENTATION_NAME
>  #define TARGET_DOCUMENTATION_NAME "x86"
>
> +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
> +sbitmap
> +ix86_get_separate_components (void)
> +{
> +  HOST_WIDE_INT offset, to_allocate;
> +  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
> +  bitmap_clear (components);
> +  struct machine_function *m = cfun->machine;
> +
> +  offset = m->frame.stack_pointer_offset;
> +  to_allocate = offset - m->frame.sse_reg_save_offset;
> +
> +  /* Shrink wrap separate uses MOV, which means APX PPX cannot be used.
> +     Experiments show that APX PPX can speed up the prologue.  If the 
> function
> +     does not exit early during actual execution, then using APX PPX is 
> faster.
> +     If the function always exits early during actual execution, then shrink
> +     wrap separate reduces the number of MOV (PUSH/POP) instructions actually
> +     executed, thus speeding up execution.
> +     foo:
> +         movl    $1, %eax
> +         testq   %rdi, %rdi
> +         jne.L60
> +         ret   ---> early return.
> +     .L60:
> +         subq    $88, %rsp     ---> belong to prologue.
> +         xorl    %eax, %eax
> +         movq    %rbx, 40 (%rsp) ---> belong to prologue.
> +         movq    8 (%rdi), %rbx
> +         movq    %rbp, 48 (%rsp) ---> belong to prologue.
> +         movq    %rdi, %rbp
> +         testq   %rbx, %rbx
> +         jne.L61
> +         movq    40 (%rsp), %rbx
> +         movq    48 (%rsp), %rbp
> +         addq    $88, %rsp
> +         ret
> +     .L61:
> +         movq    %r12, 56 (%rsp) ---> belong to prologue.
> +         movq    %r13, 64 (%rsp) ---> belong to prologue.
> +         movq    %r14, 72 (%rsp) ---> belong to prologue.

Please remove leading spaces before tab in the above assembly. "git
diff" will mark this part in red.

> +     ... ...
> +
> +     It is a trade-off.  Disable shrink wrap separate when PPX is enabled.  
> */

No need to reiterate that this is a trade-off. You already explained
it above in great detail.

> +  if ((TARGET_APX_PPX && !crtl->calls_eh_return)
> +      || cfun->machine->func_type != TYPE_NORMAL
> +      || TARGET_SEH
> +      || crtl->stack_realign_needed
> +      || m->call_ms2sysv)
> +    return components;
> +
> +  /* Since shrink wrapping separate uses MOV instead of PUSH/POP.
> +     Disable shrink wrap separate when MOV is prohibited.  */
> +  if (save_regs_using_push_pop (to_allocate))
> +    return components;
> +
> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +    if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> +      {
> +       /* Skip registers with large offsets, where a pseudo may be needed.  
> */
> +       if (IN_RANGE (offset, -0x8000, 0x7fff))
> +         bitmap_set_bit (components, regno);
> +       offset += UNITS_PER_WORD;
> +      }
> +
> +  /* Don't mess with the following registers.  */
> +  if (frame_pointer_needed)
> +    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
> +
> +  if (crtl->drap_reg)
> +    bitmap_clear_bit (components, REGNO (crtl->drap_reg));
> +
> +  if (pic_offset_table_rtx)
> +    bitmap_clear_bit (components, REAL_PIC_OFFSET_TABLE_REGNUM);
> +
> +  return components;
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
> +sbitmap
> +ix86_components_for_bb (basic_block bb)
> +{
> +  bitmap in = DF_LIVE_IN (bb);
> +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> +
> +  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
> +  bitmap_clear (components);
> +
> +  function_abi_aggregator callee_abis;
> +  rtx_insn *insn;
> +  FOR_BB_INSNS (bb, insn)
> +    if (CALL_P (insn))
> +      callee_abis.note_callee_abi (insn_callee_abi (insn));
> +  HARD_REG_SET extra_caller_saves = callee_abis.caller_save_regs 
> (*crtl->abi);
> +
> +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +    if (!fixed_regs[regno]
> +       && (TEST_HARD_REG_BIT (extra_caller_saves, regno)
> +           || bitmap_bit_p (in, regno)
> +           || bitmap_bit_p (gen, regno)
> +           || bitmap_bit_p (kill, regno)))
> +      bitmap_set_bit (components, regno);
> +
> +  return components;
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.  */
> +void
> +ix86_disqualify_components (sbitmap, edge, sbitmap, bool)
> +{
> +  /* Nothing to do for x86.  */
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
> +void
> +ix86_emit_prologue_components (sbitmap components)
> +{
> +  HOST_WIDE_INT cfa_offset;
> +  struct machine_function *m = cfun->machine;
> +
> +  cfa_offset = m->frame.reg_save_offset + m->fs.sp_offset
> +              - m->frame.stack_pointer_offset;
> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +    if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> +      {
> +       if (bitmap_bit_p (components, regno))
> +         ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
> +       cfa_offset -= UNITS_PER_WORD;
> +      }
> +}
> +
> +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
> +void
> +ix86_emit_epilogue_components (sbitmap components)
> +{
> +  HOST_WIDE_INT cfa_offset;
> +  struct machine_function *m = cfun->machine;
> +  cfa_offset = m->frame.reg_save_offset + m->fs.sp_offset
> +              - m->frame.stack_pointer_offset;
> +
> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +    if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> +      {
> +       if (bitmap_bit_p (components, regno))
> +         {
> +           rtx reg = gen_rtx_REG (word_mode, regno);
> +           rtx mem;
> +           rtx_insn *insn;
> +
> +           mem = choose_baseaddr (cfa_offset, NULL);
> +           mem = gen_frame_mem (word_mode, mem);
> +           insn = emit_move_insn (reg, mem);
> +
> +           RTX_FRAME_RELATED_P (insn) = 1;
> +           add_reg_note (insn, REG_CFA_RESTORE, reg);
> +         }
> +       cfa_offset -= UNITS_PER_WORD;
> +      }
> +}
> +

Please put a one line comment as is the case with the above functions.

> +void
> +ix86_set_handled_components (sbitmap components)
> +{
> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +    if (bitmap_bit_p (components, regno))
> +      {
> +       cfun->machine->reg_is_wrapped_separately[regno] = true;
> +       cfun->machine->use_fast_prologue_epilogue = true;
> +       cfun->machine->frame.save_regs_using_mov = true;
> +      }
> +}
> +
> +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS 
> ix86_get_separate_components
> +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB ix86_components_for_bb
> +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS ix86_disqualify_components
> +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
> +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
> +  ix86_emit_prologue_components
> +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
> +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
> +  ix86_emit_epilogue_components
> +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS ix86_set_handled_components
> +
>  struct gcc_target targetm = TARGET_INITIALIZER;
>
>  #include "gt-i386.h"
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index d32d9ad997e..5d4fd333169 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -2821,6 +2821,7 @@ struct GTY(()) machine_function {
>    /* Cached initial frame layout for the current function.  */
>    struct ix86_frame frame;

A short comment would be nice here...
> +  bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER];

... and some vertical space here.

>    /* For -fsplit-stack support: A stack local which holds a pointer to
>       the stack arguments for a function with a variable number of
>       arguments.  This is set at the start of the function and is used
> diff --git a/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C 
> b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> new file mode 100644
> index 00000000000..31bd984df02
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> @@ -0,0 +1,24 @@
> +/* { dg-do compile { target { ! ia32 } } } */

Does new functionality also handle 32-bit targets? If this is the
case, then please use two scan-rtl-dumps, one for ia32 and the other
for ! ia32. Don't disable the test just because of a different string
in the dump.

> +/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
> +typedef struct a b;
> +typedef double c;
> +struct a {
> +  b *d;
> +  b *e;
> +};
> +struct f {
> +  c g;
> +};
> +inline bool h(c i, b *m) {
> +  b *j = m->e;
> +  for (; m->e; j = j->d)
> +    if (h(i, j))
> +      return 0;
> +  return m;
> +}
> +bool k() {
> +  f *l;
> +  b *n;
> + return h(l->g, n);
> +}
> +/* { dg-final { scan-rtl-dump "The components we wrap separately are \\\[sep 
> 40 41 42 43\\\]" "pro_and_epilogue" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/interrupt-16.c 
> b/gcc/testsuite/gcc.target/i386/interrupt-16.c
> index cb45ba54e3d..ca4441b3aee 100644
> --- a/gcc/testsuite/gcc.target/i386/interrupt-16.c
> +++ b/gcc/testsuite/gcc.target/i386/interrupt-16.c
> @@ -18,5 +18,5 @@ foo (int i)
>  /* { dg-final { scan-assembler-not "(push|pop)(l|q)\[\\t \]*%(r|e)bp" } } */
>  /* { dg-final { scan-assembler-not "(push|pop)l\[\\t \]*%edi" { target ia32 
> } } } */
>  /* { dg-final { scan-assembler-not "(push|pop)q\[\\t \]*%r\[0-9\]+" { target 
> { ! ia32 } } } } */
> -/* { dg-final { scan-assembler-times "pushq\[\\t \]*%rdi" 1 { target { ! 
> ia32 } } } } */
> -/* { dg-final { scan-assembler-times "popq\[\\t \]*%rdi" 1 { target { ! ia32 
> } } } } */
> +/* { dg-final { scan-assembler-times "(pushq.*%rdi|subq.*\\\$8,.*%rsp)" 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "(popq.*%rdi|addq.*\\\$8,.*%rsp)" 1 { 
> target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c 
> b/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
> new file mode 100644
> index 00000000000..0f2449f68b6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
> +
> +/* Avoid inserting sub between test-je-jle to change EFlags, lea should be 
> used here
> +        xorl    %eax, %eax
> +        testl   %edi, %edi
> +        je      .L11
> +        sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
> +        movq    %r13, 8(%rsp)
> +        movl    $1, %r13d
> +        jle     .L4
> +*/
> +int foo (int num)
> +{
> +  if (!num)
> +    return 0;
> +
> +  register int r13 __asm ("r13") = 1;
> +
> +  for ( int i = 0; i < num; i++)
> +    {
> +      register int r12 __asm ("r12") = 1;
> +      asm volatile ("" : "+r" (r12), "+r" (r13));
> +    }
> +
> +  return 1;
> +}
> +/* { dg-final { scan-rtl-dump "The components we wrap separately are \\\[sep 
> 40\\\]" "pro_and_epilogue" } } */
> +/* { dg-final { scan-assembler "leaq.*(%rsp)" } } */
> diff --git a/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c 
> b/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c
> index 5f3d3e166af..46fc4648dbd 100644
> --- a/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c
> +++ b/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fno-tree-vectorize -mabi=sysv" } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mabi=sysv 
> -fno-shrink-wrap-separate" } */
>
>  extern int glb1, gbl2, gbl3;
>
> diff --git a/gcc/testsuite/gfortran.dg/guality/arg1.f90 
> b/gcc/testsuite/gfortran.dg/guality/arg1.f90
> index 332a4ed1d87..775b7bb304f 100644
> --- a/gcc/testsuite/gfortran.dg/guality/arg1.f90
> +++ b/gcc/testsuite/gfortran.dg/guality/arg1.f90
> @@ -1,5 +1,5 @@
>  ! { dg-do run }
> -! { dg-options "-g" }
> +! { dg-options "-fno-shrink-wrap -g" }
>    integer :: a(10), b(12)
>    call sub (a, 10)
>    call sub (b, 12)
> --
> 2.34.1
>

Re: [PATCH V2] x86: Enable separate shrink wrapping

Reply via email to