[RFA] optimizing predictable branches on x86

Jan Hubicka Mon, 03 Mar 2008 03:38:39 -0800

Hi,
I had to tweak the testcase a bit to not compute minimum: GCC optimizes
this early into MIN_EXPR throwing away any profile information.  If we
get serious here we can maintain it via histogram, but I am not sure it
is worth the effort at least until IL is sanitized and expansion cleaned
up with tupple branch.


I also had to fix bug in branch prediction ignoring __builtin_expect of
any early inlined function and update your testcase to not use
__buliltin_expect in predictable case.
However this is what I get on AthlonXP:
 no deps,   predictable -- C    code took  13.71ns per iteration
 no deps,   predictable -- cmov code took  13.83ns per iteration
 no deps,   predictable -- jmp  code took  13.94ns per iteration
 has deps,   predictable -- C    code took  15.54ns per iteration
 has deps,   predictable -- cmov code took  22.21ns per iteration
 has deps,   predictable -- jmp  code took  16.55ns per iteration
 no deps, unpredictable -- C    code took  13.99ns per iteration
 no deps, unpredictable -- cmov code took  13.76ns per iteration
 no deps, unpredictable -- jmp  code took  26.12ns per iteration
 has deps, unpredictable -- C    code took  120.37ns per iteration
 has deps, unpredictable -- cmov code took  120.76ns per iteration
 has deps, unpredictable -- jmp  code took  165.82ns per iteration

The patch is quite SPEC neutral, saving 190Kb in FDO binaries.  Still I
think it is worthwhile to have especially because I do believe that all
the target COST predicates should be populated by hotness argument so we
get same results for -Os or -O2 with profile feeback specifying that
nothing is executed or if one marks all functions cold.
At the moment profile feedback with all functions not executed leads to
code smaller than -O2 but closer to -O2 than -Os so there is quite some
fruit here. With LTO or for codebases with more __builtin_expect and
cold hints like kernel or libstdc++ we can get a lot of this benefits
without FDO too.

The patch was bootstrapped/regtested i686-linux.  I can approve the i386
and prediction changes, however I will wait for approval for the
BRANCH_COST target macro change.
Note that bit irritating fact is that BRANCH_COST is querries during
expansion when we still don't have hotness information propagated (I
have separate patch to this I will update later this week).  The patch
uses at least the overall hotness info about the function that still
should make difference.  We however also use the check from frontend
folding that probably ought to go completely.

Honza

        * optabs.c (expand_abs_nojump): Update BRANCH_COST call.
        * fold-cost.c (LOGICAL_OP_NON_SHORT_CIRCUIT, fold_truthop): Likewise.
        * dojump.c (do_jump): Likewise.
        * ifcvt.c (MAX_CONDITIONAL_EXECUTE): Likewise.
        (note-if_info): Add BRANCH_COST.
        (noce_try_store_flag_constants, noce_try_addcc, 
noce_try_store_flag_mask,
        noce_try_cmove_arith, noce_try_cmove_arith, noce_try_cmove_arith,
        noce_find_if_block, find_if_case_1, find_if_case_2): Use compuated
        branch cost.
        * expr.h (BRANCH_COST): Update default.
        * predict.c (predictable_edge_p): New function.
        * expmed.c (expand_smod_pow2, expand_sdiv_pow2, emit_store_flag):
        Update BRANCH_COST call.
        * basic-block.h (predictable_edge_p): Declare.
        * config/alpha/alpha.h (BRANCH_COST): Update.
        * config/frv/frv.h (BRANCH_COST): Update.
        * config/s390/s390.h (BRANCH_COST): Update.
        * config/spu/spu.h (BRANCH_COST): Update.
        * config/sparc/sparc.h (BRANCH_COST): Update.
        * config/m32r/m32r.h (BRANCH_COST): Update.
        * config/i386/i386.h (BRANCH_COST): Update.
        * config/i386/i386.c (ix86_expand_int_movcc): Update use of BRANCH_COST.
        * config/sh/sh.h (BRANCH_COST): Update.
        * config/pdp11/pdp11.h (BRANCH_COST): Update.
        * config/avr/avr.h (BRANCH_COST): Update.
        * config/crx/crx.h (BRANCH_COST): Update.
        * config/xtensa/xtensa.h (BRANCH_COST): Update.
        * config/stormy16/stormy16.h (BRANCH_COST): Update.
        * config/m68hc11/m68hc11.h (BRANCH_COST): Update.
        * config/iq2000/iq2000.h (BRANCH_COST): Update.
        * config/ia64/ia64.h (BRANCH_COST): Update.
        * config/rs6000/rs6000.h (BRANCH_COST): Update.
        * config/arc/arc.h (BRANCH_COST): Update.
        * config/score/score.h (BRANCH_COST): Update.
        * config/arm/arm.h (BRANCH_COST): Update.
        * config/pa/pa.h (BRANCH_COST): Update.
        * config/mips/mips.h (BRANCH_COST): Update.
        * config/vax/vax.h (BRANCH_COST): Update.
        * config/h8300/h8300.h (BRANCH_COST): Update.
        * params.def (PARAM_PREDICTABLE_BRANCH_OUTCOME): New.
        * doc/invoke.texi (predictable-branch-cost-outcome): Document.
        * doc/tm.texi (BRANCH_COST): Update.
Index: optabs.c
===================================================================
*** optabs.c    (revision 132800)
--- optabs.c    (working copy)
*************** expand_abs_nojump (enum machine_mode mod
*** 3425,3431 ****
       value of X as (((signed) x >> (W-1)) ^ x) - ((signed) x >> (W-1)),
       where W is the width of MODE.  */
  
!   if (GET_MODE_CLASS (mode) == MODE_INT && BRANCH_COST >= 2)
      {
        rtx extended = expand_shift (RSHIFT_EXPR, mode, op0,
                                   size_int (GET_MODE_BITSIZE (mode) - 1),
--- 3425,3433 ----
       value of X as (((signed) x >> (W-1)) ^ x) - ((signed) x >> (W-1)),
       where W is the width of MODE.  */
  
!   if (GET_MODE_CLASS (mode) == MODE_INT
!       && BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                     false) >= 2)
      {
        rtx extended = expand_shift (RSHIFT_EXPR, mode, op0,
                                   size_int (GET_MODE_BITSIZE (mode) - 1),
Index: fold-const.c
===================================================================
*** fold-const.c        (revision 132800)
--- fold-const.c        (working copy)
*************** fold_cond_expr_with_comparison (tree typ
*** 5317,5323 ****
  
  
  #ifndef LOGICAL_OP_NON_SHORT_CIRCUIT
! #define LOGICAL_OP_NON_SHORT_CIRCUIT (BRANCH_COST >= 2)
  #endif
  
  /* EXP is some logical combination of boolean tests.  See if we can
--- 5317,5325 ----
  
  
  #ifndef LOGICAL_OP_NON_SHORT_CIRCUIT
! #define LOGICAL_OP_NON_SHORT_CIRCUIT \
!   (BRANCH_COST (!cfun || cfun->function_frequency >= 
FUNCTION_FREQUENCY_NORMAL, \
!               false) >= 2)
  #endif
  
  /* EXP is some logical combination of boolean tests.  See if we can
*************** fold_truthop (enum tree_code code, tree 
*** 5565,5571 ****
       that can be merged.  Avoid doing this if the RHS is a floating-point
       comparison since those can trap.  */
  
!   if (BRANCH_COST >= 2
        && ! FLOAT_TYPE_P (TREE_TYPE (rl_arg))
        && simple_operand_p (rl_arg)
        && simple_operand_p (rr_arg))
--- 5567,5574 ----
       that can be merged.  Avoid doing this if the RHS is a floating-point
       comparison since those can trap.  */
  
!   if (BRANCH_COST (!cfun || cfun->function_frequency >= 
FUNCTION_FREQUENCY_NORMAL,
!                  false) >= 2
        && ! FLOAT_TYPE_P (TREE_TYPE (rl_arg))
        && simple_operand_p (rl_arg)
        && simple_operand_p (rr_arg))
Index: dojump.c
===================================================================
*** dojump.c    (revision 132800)
--- dojump.c    (working copy)
*************** do_jump (tree exp, rtx if_false_label, r
*** 515,521 ****
        /* High branch cost, expand as the bitwise AND of the conditions.
         Do the same if the RHS has side effects, because we're effectively
         turning a TRUTH_AND_EXPR into a TRUTH_ANDIF_EXPR.  */
!       if (BRANCH_COST >= 4 || TREE_SIDE_EFFECTS (TREE_OPERAND (exp, 1)))
        goto normal;
  
        if (if_false_label == NULL_RTX)
--- 515,523 ----
        /* High branch cost, expand as the bitwise AND of the conditions.
         Do the same if the RHS has side effects, because we're effectively
         turning a TRUTH_AND_EXPR into a TRUTH_ANDIF_EXPR.  */
!       if (BRANCH_COST (cfun->function_frequency > FUNCTION_FREQUENCY_NORMAL,
!                      false) >= 4
!         || TREE_SIDE_EFFECTS (TREE_OPERAND (exp, 1)))
        goto normal;
  
        if (if_false_label == NULL_RTX)
*************** do_jump (tree exp, rtx if_false_label, r
*** 535,541 ****
        /* High branch cost, expand as the bitwise OR of the conditions.
         Do the same if the RHS has side effects, because we're effectively
         turning a TRUTH_OR_EXPR into a TRUTH_ORIF_EXPR.  */
!       if (BRANCH_COST >= 4 || TREE_SIDE_EFFECTS (TREE_OPERAND (exp, 1)))
        goto normal;
  
        if (if_true_label == NULL_RTX)
--- 537,544 ----
        /* High branch cost, expand as the bitwise OR of the conditions.
         Do the same if the RHS has side effects, because we're effectively
         turning a TRUTH_OR_EXPR into a TRUTH_ORIF_EXPR.  */
!       if (BRANCH_COST (!optimize_size, false)>= 4
!         || TREE_SIDE_EFFECTS (TREE_OPERAND (exp, 1)))
        goto normal;
  
        if (if_true_label == NULL_RTX)
Index: ifcvt.c
===================================================================
*** ifcvt.c     (revision 132800)
--- ifcvt.c     (working copy)
***************
*** 67,73 ****
  #endif
  
  #ifndef MAX_CONDITIONAL_EXECUTE
! #define MAX_CONDITIONAL_EXECUTE   (BRANCH_COST + 1)
  #endif
  
  #define IFCVT_MULTIPLE_DUMPS 1
--- 67,75 ----
  #endif
  
  #ifndef MAX_CONDITIONAL_EXECUTE
! #define MAX_CONDITIONAL_EXECUTE \
!   (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL, false) 
\
!    + 1)
  #endif
  
  #define IFCVT_MULTIPLE_DUMPS 1
*************** struct noce_if_info
*** 626,631 ****
--- 628,636 ----
       from TEST_BB.  For the noce transformations, we allow the symmetric
       form as well.  */
    bool then_else_reversed;
+ 
+   /* Estimated cost of the particular branch instruction.  */
+   int branch_cost;
  };
  
  static rtx noce_emit_store_flag (struct noce_if_info *, rtx, int, int);
*************** noce_try_store_flag_constants (struct no
*** 963,982 ****
        normalize = 0;
        else if (ifalse == 0 && exact_log2 (itrue) >= 0
               && (STORE_FLAG_VALUE == 1
!                  || BRANCH_COST >= 2))
        normalize = 1;
        else if (itrue == 0 && exact_log2 (ifalse) >= 0 && can_reverse
!              && (STORE_FLAG_VALUE == 1 || BRANCH_COST >= 2))
        normalize = 1, reversep = 1;
        else if (itrue == -1
               && (STORE_FLAG_VALUE == -1
!                  || BRANCH_COST >= 2))
        normalize = -1;
        else if (ifalse == -1 && can_reverse
!              && (STORE_FLAG_VALUE == -1 || BRANCH_COST >= 2))
        normalize = -1, reversep = 1;
!       else if ((BRANCH_COST >= 2 && STORE_FLAG_VALUE == -1)
!              || BRANCH_COST >= 3)
        normalize = -1;
        else
        return FALSE;
--- 968,987 ----
        normalize = 0;
        else if (ifalse == 0 && exact_log2 (itrue) >= 0
               && (STORE_FLAG_VALUE == 1
!                  || if_info->branch_cost >= 2))
        normalize = 1;
        else if (itrue == 0 && exact_log2 (ifalse) >= 0 && can_reverse
!              && (STORE_FLAG_VALUE == 1 || if_info->branch_cost >= 2))
        normalize = 1, reversep = 1;
        else if (itrue == -1
               && (STORE_FLAG_VALUE == -1
!                  || if_info->branch_cost >= 2))
        normalize = -1;
        else if (ifalse == -1 && can_reverse
!              && (STORE_FLAG_VALUE == -1 || if_info->branch_cost >= 2))
        normalize = -1, reversep = 1;
!       else if ((if_info->branch_cost >= 2 && STORE_FLAG_VALUE == -1)
!              || if_info->branch_cost >= 3)
        normalize = -1;
        else
        return FALSE;
*************** noce_try_addcc (struct noce_if_info *if_
*** 1107,1113 ****
  
        /* If that fails, construct conditional increment or decrement using
         setcc.  */
!       if (BRANCH_COST >= 2
          && (XEXP (if_info->a, 1) == const1_rtx
              || XEXP (if_info->a, 1) == constm1_rtx))
          {
--- 1112,1118 ----
  
        /* If that fails, construct conditional increment or decrement using
         setcc.  */
!       if (if_info->branch_cost >= 2
          && (XEXP (if_info->a, 1) == const1_rtx
              || XEXP (if_info->a, 1) == constm1_rtx))
          {
*************** noce_try_store_flag_mask (struct noce_if
*** 1158,1164 ****
    int reversep;
  
    reversep = 0;
!   if ((BRANCH_COST >= 2
         || STORE_FLAG_VALUE == -1)
        && ((if_info->a == const0_rtx
           && rtx_equal_p (if_info->b, if_info->x))
--- 1163,1169 ----
    int reversep;
  
    reversep = 0;
!   if ((if_info->branch_cost >= 2
         || STORE_FLAG_VALUE == -1)
        && ((if_info->a == const0_rtx
           && rtx_equal_p (if_info->b, if_info->x))
*************** noce_try_cmove_arith (struct noce_if_inf
*** 1317,1323 ****
    /* ??? FIXME: Magic number 5.  */
    if (cse_not_expected
        && MEM_P (a) && MEM_P (b)
!       && BRANCH_COST >= 5)
      {
        a = XEXP (a, 0);
        b = XEXP (b, 0);
--- 1322,1328 ----
    /* ??? FIXME: Magic number 5.  */
    if (cse_not_expected
        && MEM_P (a) && MEM_P (b)
!       && if_info->branch_cost >= 5)
      {
        a = XEXP (a, 0);
        b = XEXP (b, 0);
*************** noce_try_cmove_arith (struct noce_if_inf
*** 1347,1353 ****
    if (insn_a)
      {
        insn_cost = insn_rtx_cost (PATTERN (insn_a));
!       if (insn_cost == 0 || insn_cost > COSTS_N_INSNS (BRANCH_COST))
        return FALSE;
      }
    else
--- 1352,1358 ----
    if (insn_a)
      {
        insn_cost = insn_rtx_cost (PATTERN (insn_a));
!       if (insn_cost == 0 || insn_cost > COSTS_N_INSNS (if_info->branch_cost))
        return FALSE;
      }
    else
*************** noce_try_cmove_arith (struct noce_if_inf
*** 1356,1362 ****
    if (insn_b)
      {
        insn_cost += insn_rtx_cost (PATTERN (insn_b));
!       if (insn_cost == 0 || insn_cost > COSTS_N_INSNS (BRANCH_COST))
          return FALSE;
      }
  
--- 1361,1367 ----
    if (insn_b)
      {
        insn_cost += insn_rtx_cost (PATTERN (insn_b));
!       if (insn_cost == 0 || insn_cost > COSTS_N_INSNS (if_info->branch_cost))
          return FALSE;
      }
  
*************** noce_find_if_block (basic_block test_bb,
*** 2803,2808 ****
--- 2808,2815 ----
    if_info.cond_earliest = cond_earliest;
    if_info.jump = jump;
    if_info.then_else_reversed = then_else_reversed;
+   if_info.branch_cost = BRANCH_COST (maybe_hot_bb_p (test_bb),
+                                    predictable_edge_p (then_edge));
  
    /* Do the real work.  */
  
*************** find_if_case_1 (basic_block test_bb, edg
*** 3569,3575 ****
             test_bb->index, then_bb->index);
  
    /* THEN is small.  */
!   if (! cheap_bb_rtx_cost_p (then_bb, COSTS_N_INSNS (BRANCH_COST)))
      return FALSE;
  
    /* Registers set are dead, or are predicable.  */
--- 3576,3584 ----
             test_bb->index, then_bb->index);
  
    /* THEN is small.  */
!   if (! cheap_bb_rtx_cost_p (then_bb,
!       COSTS_N_INSNS (BRANCH_COST (maybe_hot_bb_p (then_edge->src),
!                                   predictable_edge_p (then_edge)))))
      return FALSE;
  
    /* Registers set are dead, or are predicable.  */
*************** find_if_case_2 (basic_block test_bb, edg
*** 3683,3689 ****
             test_bb->index, else_bb->index);
  
    /* ELSE is small.  */
!   if (! cheap_bb_rtx_cost_p (else_bb, COSTS_N_INSNS (BRANCH_COST)))
      return FALSE;
  
    /* Registers set are dead, or are predicable.  */
--- 3692,3700 ----
             test_bb->index, else_bb->index);
  
    /* ELSE is small.  */
!   if (! cheap_bb_rtx_cost_p (else_bb, 
!       COSTS_N_INSNS (BRANCH_COST (maybe_hot_bb_p (else_edge->src),
!                                   predictable_edge_p (else_edge)))))
      return FALSE;
  
    /* Registers set are dead, or are predicable.  */
Index: expr.h
===================================================================
*** expr.h      (revision 132800)
--- expr.h      (working copy)
*************** along with GCC; see the file COPYING3.  
*** 36,42 ****
  
  /* The default branch cost is 1.  */
  #ifndef BRANCH_COST
! #define BRANCH_COST 1
  #endif
  
  /* This is the 4th arg to `expand_expr'.
--- 36,42 ----
  
  /* The default branch cost is 1.  */
  #ifndef BRANCH_COST
! #define BRANCH_COST(hot_p, predictable_p) 1
  #endif
  
  /* This is the 4th arg to `expand_expr'.
Index: predict.c
===================================================================
*** predict.c   (revision 132800)
--- predict.c   (working copy)
*************** gate_estimate_probability (void)
*** 1915,1920 ****
--- 1923,1943 ----
    return flag_guess_branch_prob;
  }
  
+ /* Return true when edge E is likely to be well predictable by branch
+    predictor.  */
+ 
+ bool
+ predictable_edge_p (edge e)
+ {
+   if (profile_status == PROFILE_ABSENT)
+     return false;
+   if ((e->probability
+        <= PARAM_VALUE (PARAM_PREDICTABLE_BRANCH_OUTCOME) * REG_BR_PROB_BASE / 
100)
+       || (REG_BR_PROB_BASE - e->probability
+           <= PARAM_VALUE (PARAM_PREDICTABLE_BRANCH_OUTCOME) * 
REG_BR_PROB_BASE / 100))
+     return true;
+   return false;
+ }
+ 
  struct tree_opt_pass pass_profile = 
  {
    "profile",                          /* name */
Index: expmed.c
===================================================================
*** expmed.c    (revision 132800)
--- expmed.c    (working copy)
*************** expand_smod_pow2 (enum machine_mode mode
*** 3560,3566 ****
    result = gen_reg_rtx (mode);
  
    /* Avoid conditional branches when they're expensive.  */
!   if (BRANCH_COST >= 2
        && !optimize_size)
      {
        rtx signmask = emit_store_flag (result, LT, op0, const0_rtx,
--- 3560,3567 ----
    result = gen_reg_rtx (mode);
  
    /* Avoid conditional branches when they're expensive.  */
!   if (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                  false) >= 2
        && !optimize_size)
      {
        rtx signmask = emit_store_flag (result, LT, op0, const0_rtx,
*************** expand_sdiv_pow2 (enum machine_mode mode
*** 3660,3666 ****
    logd = floor_log2 (d);
    shift = build_int_cst (NULL_TREE, logd);
  
!   if (d == 2 && BRANCH_COST >= 1)
      {
        temp = gen_reg_rtx (mode);
        temp = emit_store_flag (temp, LT, op0, const0_rtx, mode, 0, 1);
--- 3661,3669 ----
    logd = floor_log2 (d);
    shift = build_int_cst (NULL_TREE, logd);
  
!   if (d == 2
!       && BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                     false) >= 1)
      {
        temp = gen_reg_rtx (mode);
        temp = emit_store_flag (temp, LT, op0, const0_rtx, mode, 0, 1);
*************** expand_sdiv_pow2 (enum machine_mode mode
*** 3670,3676 ****
      }
  
  #ifdef HAVE_conditional_move
!   if (BRANCH_COST >= 2)
      {
        rtx temp2;
  
--- 3673,3680 ----
      }
  
  #ifdef HAVE_conditional_move
!   if (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL, 
false)
!       >= 2)
      {
        rtx temp2;
  
*************** expand_sdiv_pow2 (enum machine_mode mode
*** 3699,3705 ****
      }
  #endif
  
!   if (BRANCH_COST >= 2)
      {
        int ushift = GET_MODE_BITSIZE (mode) - logd;
  
--- 3703,3710 ----
      }
  #endif
  
!   if (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                  false) >= 2)
      {
        int ushift = GET_MODE_BITSIZE (mode) - logd;
  
*************** emit_store_flag (rtx target, enum rtx_co
*** 5413,5419 ****
       comparison with zero.  Don't do any of these cases if branches are
       very cheap.  */
  
!   if (BRANCH_COST > 0
        && GET_MODE_CLASS (mode) == MODE_INT && (code == EQ || code == NE)
        && op1 != const0_rtx)
      {
--- 5418,5425 ----
       comparison with zero.  Don't do any of these cases if branches are
       very cheap.  */
  
!   if (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                  false) > 0
        && GET_MODE_CLASS (mode) == MODE_INT && (code == EQ || code == NE)
        && op1 != const0_rtx)
      {
*************** emit_store_flag (rtx target, enum rtx_co
*** 5436,5445 ****
       do LE and GT if branches are expensive since they are expensive on
       2-operand machines.  */
  
!   if (BRANCH_COST == 0
        || GET_MODE_CLASS (mode) != MODE_INT || op1 != const0_rtx
        || (code != EQ && code != NE
!         && (BRANCH_COST <= 1 || (code != LE && code != GT))))
      return 0;
  
    /* See what we need to return.  We can only return a 1, -1, or the
--- 5442,5454 ----
       do LE and GT if branches are expensive since they are expensive on
       2-operand machines.  */
  
!   if (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                  false) == 0
        || GET_MODE_CLASS (mode) != MODE_INT || op1 != const0_rtx
        || (code != EQ && code != NE
!         && (BRANCH_COST (cfun->function_frequency
!                          >= FUNCTION_FREQUENCY_NORMAL,
!                          false) <= 1 || (code != LE && code != GT))))
      return 0;
  
    /* See what we need to return.  We can only return a 1, -1, or the
*************** emit_store_flag (rtx target, enum rtx_co
*** 5535,5541 ****
         that "or", which is an extra insn, so we only handle EQ if branches
         are expensive.  */
  
!       if (tem == 0 && (code == NE || BRANCH_COST > 1))
        {
          if (rtx_equal_p (subtarget, op0))
            subtarget = 0;
--- 5544,5554 ----
         that "or", which is an extra insn, so we only handle EQ if branches
         are expensive.  */
  
!       if (tem == 0
!         && (code == NE
!             || BRANCH_COST (cfun->function_frequency
!                             >= FUNCTION_FREQUENCY_NORMAL,
!                             false) > 1))
        {
          if (rtx_equal_p (subtarget, op0))
            subtarget = 0;
Index: basic-block.h
===================================================================
*** basic-block.h       (revision 132800)
--- basic-block.h       (working copy)
*************** extern void guess_outgoing_edge_probabil
*** 839,844 ****
--- 839,845 ----
  extern void remove_predictions_associated_with_edge (edge);
  extern bool edge_probability_reliable_p (const_edge);
  extern bool br_prob_note_reliable_p (const_rtx);
+ extern bool predictable_edge_p (edge);
  
  /* In cfg.c  */
  extern void dump_regset (regset, FILE *);
Index: config/alpha/alpha.h
===================================================================
*** config/alpha/alpha.h        (revision 132800)
--- config/alpha/alpha.h        (working copy)
*************** extern int alpha_memory_latency;
*** 631,637 ****
  #define MEMORY_MOVE_COST(MODE,CLASS,IN)  (2*alpha_memory_latency)
  
  /* Provide the cost of a branch.  Exact meaning under development.  */
! #define BRANCH_COST 5
  
  /* Stack layout; function entry, exit and calling.  */
  
--- 631,637 ----
  #define MEMORY_MOVE_COST(MODE,CLASS,IN)  (2*alpha_memory_latency)
  
  /* Provide the cost of a branch.  Exact meaning under development.  */
! #define BRANCH_COST(hot_p, predictable_p) 5
  
  /* Stack layout; function entry, exit and calling.  */
  
Index: config/frv/frv.h
===================================================================
*** config/frv/frv.h    (revision 132800)
--- config/frv/frv.h    (working copy)
*************** do {                                                    \
*** 2193,2199 ****
  
  /* A C expression for the cost of a branch instruction.  A value of 1 is the
     default; other values are interpreted relative to that.  */
! #define BRANCH_COST frv_branch_cost_int
  
  /* Define this macro as a C expression which is nonzero if accessing less than
     a word of memory (i.e. a `char' or a `short') is no faster than accessing a
--- 2193,2199 ----
  
  /* A C expression for the cost of a branch instruction.  A value of 1 is the
     default; other values are interpreted relative to that.  */
! #define BRANCH_COST(hot_p, predictable_p) frv_branch_cost_int
  
  /* Define this macro as a C expression which is nonzero if accessing less than
     a word of memory (i.e. a `char' or a `short') is no faster than accessing a
Index: config/s390/s390.h
===================================================================
*** config/s390/s390.h  (revision 132800)
--- config/s390/s390.h  (working copy)
*************** extern struct rtx_def *s390_compare_op0,
*** 780,786 ****
  
  /* A C expression for the cost of a branch instruction.  A value of 1
     is the default; other values are interpreted relative to that.  */
! #define BRANCH_COST 1
  
  /* Nonzero if access to memory by bytes is slow and undesirable.  */
  #define SLOW_BYTE_ACCESS 1
--- 780,786 ----
  
  /* A C expression for the cost of a branch instruction.  A value of 1
     is the default; other values are interpreted relative to that.  */
! #define BRANCH_COST(hot_p, predictable_p) 1
  
  /* Nonzero if access to memory by bytes is slow and undesirable.  */
  #define SLOW_BYTE_ACCESS 1
Index: config/spu/spu.h
===================================================================
*** config/spu/spu.h    (revision 132800)
--- config/spu/spu.h    (working copy)
*************** targetm.resolve_overloaded_builtin = spu
*** 456,462 ****
  
  /* Costs */
  
! #define BRANCH_COST spu_branch_cost
  
  #define SLOW_BYTE_ACCESS 0
  
--- 456,462 ----
  
  /* Costs */
  
! #define BRANCH_COST(hot_p, predictable_p) spu_branch_cost
  
  #define SLOW_BYTE_ACCESS 0
  
Index: config/sparc/sparc.h
===================================================================
*** config/sparc/sparc.h        (revision 132800)
--- config/sparc/sparc.h        (working copy)
*************** do {                                    
*** 2180,2186 ****
     On Niagara-2, a not-taken branch costs 1 cycle whereas a taken
     branch costs 6 cycles.  */
  
! #define BRANCH_COST \
        ((sparc_cpu == PROCESSOR_V9 \
          || sparc_cpu == PROCESSOR_ULTRASPARC) \
         ? 7 \
--- 2180,2186 ----
     On Niagara-2, a not-taken branch costs 1 cycle whereas a taken
     branch costs 6 cycles.  */
  
! #define BRANCH_COST (hot_p, predictable_p) \
        ((sparc_cpu == PROCESSOR_V9 \
          || sparc_cpu == PROCESSOR_ULTRASPARC) \
         ? 7 \
Index: config/m32r/m32r.h
===================================================================
*** config/m32r/m32r.h  (revision 132800)
--- config/m32r/m32r.h  (working copy)
*************** L2:     .word STATIC
*** 1219,1225 ****
  /* A value of 2 here causes GCC to avoid using branches in comparisons like
     while (a < N && a).  Branches aren't that expensive on the M32R so
     we define this as 1.  Defining it as 2 had a heavy hit in fp-bit.c.  */
! #define BRANCH_COST ((TARGET_BRANCH_COST) ? 2 : 1)
  
  /* Nonzero if access to memory by bytes is slow and undesirable.
     For RISC chips, it means that access to memory by bytes is no
--- 1219,1225 ----
  /* A value of 2 here causes GCC to avoid using branches in comparisons like
     while (a < N && a).  Branches aren't that expensive on the M32R so
     we define this as 1.  Defining it as 2 had a heavy hit in fp-bit.c.  */
! #define BRANCH_COST(hot_p, predictable_p) ((TARGET_BRANCH_COST) ? 2 : 1)
  
  /* Nonzero if access to memory by bytes is slow and undesirable.
     For RISC chips, it means that access to memory by bytes is no
Index: config/i386/i386.h
===================================================================
*** config/i386/i386.h  (revision 132800)
--- config/i386/i386.h  (working copy)
*************** do {                                                    \
*** 2052,2058 ****
  /* A C expression for the cost of a branch instruction.  A value of 1
     is the default; other values are interpreted relative to that.  */
  
! #define BRANCH_COST ix86_branch_cost
  
  /* Define this macro as a C expression which is nonzero if accessing
     less than a word of memory (i.e. a `char' or a `short') is no
--- 2052,2059 ----
  /* A C expression for the cost of a branch instruction.  A value of 1
     is the default; other values are interpreted relative to that.  */
  
! #define BRANCH_COST(hot_p, predictable_p) \
!   (!(hot_p) ? 2 : (predictable_p) ? 0 : ix86_branch_cost)
  
  /* Define this macro as a C expression which is nonzero if accessing
     less than a word of memory (i.e. a `char' or a `short') is no
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c  (revision 132800)
--- config/i386/i386.c  (working copy)
*************** ix86_expand_int_movcc (rtx operands[])
*** 12819,12825 ****
         */
  
        if ((!TARGET_CMOVE || (mode == QImode && TARGET_PARTIAL_REG_STALL))
!         && BRANCH_COST >= 2)
        {
          if (cf == 0)
            {
--- 12819,12826 ----
         */
  
        if ((!TARGET_CMOVE || (mode == QImode && TARGET_PARTIAL_REG_STALL))
!         && BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                         false) >= 2)
        {
          if (cf == 0)
            {
*************** ix86_expand_int_movcc (rtx operands[])
*** 12904,12910 ****
        optab op;
        rtx var, orig_out, out, tmp;
  
!       if (BRANCH_COST <= 2)
        return 0; /* FAIL */
  
        /* If one of the two operands is an interesting constant, load a
--- 12905,12912 ----
        optab op;
        rtx var, orig_out, out, tmp;
  
!       if (BRANCH_COST (cfun->function_frequency >= FUNCTION_FREQUENCY_NORMAL,
!                      false) <= 2)
        return 0; /* FAIL */
  
        /* If one of the two operands is an interesting constant, load a
Index: config/sh/sh.h
===================================================================
*** config/sh/sh.h      (revision 132800)
--- config/sh/sh.h      (working copy)
*************** struct sh_args {
*** 2822,2828 ****
     The SH1 does not have delay slots, hence we get a pipeline stall
     at every branch.  The SH4 is superscalar, so the single delay slot
     is not sufficient to keep both pipelines filled.  */
! #define BRANCH_COST (TARGET_SH5 ? 1 : ! TARGET_SH2 || TARGET_HARD_SH4 ? 2 : 1)
  
  /* Assembler output control.  */
  
--- 2822,2829 ----
     The SH1 does not have delay slots, hence we get a pipeline stall
     at every branch.  The SH4 is superscalar, so the single delay slot
     is not sufficient to keep both pipelines filled.  */
! #define BRANCH_COST(hot_p, predictable_p) \
!       (TARGET_SH5 ? 1 : ! TARGET_SH2 || TARGET_HARD_SH4 ? 2 : 1)
  
  /* Assembler output control.  */
  
Index: config/pdp11/pdp11.h
===================================================================
*** config/pdp11/pdp11.h        (revision 132800)
--- config/pdp11/pdp11.h        (working copy)
*************** JMP     FUNCTION        0x0058  0x0000 <- FUNCTION
*** 1059,1065 ****
  /* there is no point in avoiding branches on a pdp, 
     since branches are really cheap - I just want to find out
     how much difference the BRANCH_COST macro makes in code */
! #define BRANCH_COST (TARGET_BRANCH_CHEAP ? 0 : 1)
  
  
  #define COMPARE_FLAG_MODE HImode
--- 1059,1065 ----
  /* there is no point in avoiding branches on a pdp, 
     since branches are really cheap - I just want to find out
     how much difference the BRANCH_COST macro makes in code */
! #define BRANCH_COST(hot_p, predictable_p) (TARGET_BRANCH_CHEAP ? 0 : 1)
  
  
  #define COMPARE_FLAG_MODE HImode
Index: config/avr/avr.h
===================================================================
*** config/avr/avr.h    (revision 132800)
--- config/avr/avr.h    (working copy)
*************** do {                                                            
            \
*** 481,487 ****
                                         (MODE)==SImode ? 8 :   \
                                         (MODE)==SFmode ? 8 : 16)
  
! #define BRANCH_COST 0
  
  #define SLOW_BYTE_ACCESS 0
  
--- 481,487 ----
                                         (MODE)==SImode ? 8 :   \
                                         (MODE)==SFmode ? 8 : 16)
  
! #define BRANCH_COST(hot_p, predictable_p) 0
  
  #define SLOW_BYTE_ACCESS 0
  
Index: config/crx/crx.h
===================================================================
*** config/crx/crx.h    (revision 132800)
--- config/crx/crx.h    (working copy)
*************** struct cumulative_args
*** 420,426 ****
  /* Moving to processor register flushes pipeline - thus asymmetric */
  #define REGISTER_MOVE_COST(MODE, FROM, TO) ((TO != GENERAL_REGS) ? 8 : 2)
  /* Assume best case (branch predicted) */
! #define BRANCH_COST 2
  
  #define SLOW_BYTE_ACCESS  1
  
--- 420,426 ----
  /* Moving to processor register flushes pipeline - thus asymmetric */
  #define REGISTER_MOVE_COST(MODE, FROM, TO) ((TO != GENERAL_REGS) ? 8 : 2)
  /* Assume best case (branch predicted) */
! #define BRANCH_COST(hot_p, predictable_p) 2
  
  #define SLOW_BYTE_ACCESS  1
  
Index: config/xtensa/xtensa.h
===================================================================
*** config/xtensa/xtensa.h      (revision 132800)
--- config/xtensa/xtensa.h      (working copy)
*************** typedef struct xtensa_args
*** 898,904 ****
  
  #define MEMORY_MOVE_COST(MODE, CLASS, IN) 4
  
! #define BRANCH_COST 3
  
  /* How to refer to registers in assembler output.
     This sequence is indexed by compiler's hard-register-number (see above).  
*/
--- 898,904 ----
  
  #define MEMORY_MOVE_COST(MODE, CLASS, IN) 4
  
! #define BRANCH_COST(hot_p, predictable_p) 3
  
  /* How to refer to registers in assembler output.
     This sequence is indexed by compiler's hard-register-number (see above).  
*/
Index: config/stormy16/stormy16.h
===================================================================
*** config/stormy16/stormy16.h  (revision 132800)
--- config/stormy16/stormy16.h  (working copy)
*************** do {                                                    \
*** 582,588 ****
  
  #define MEMORY_MOVE_COST(M,C,I) (5 + memory_move_secondary_cost (M, C, I))
  
! #define BRANCH_COST 5
  
  #define SLOW_BYTE_ACCESS 0
  
--- 582,588 ----
  
  #define MEMORY_MOVE_COST(M,C,I) (5 + memory_move_secondary_cost (M, C, I))
  
! #define BRANCH_COST(hot_p, predictable_p) 5
  
  #define SLOW_BYTE_ACCESS 0
  
Index: config/m68hc11/m68hc11.h
===================================================================
*** config/m68hc11/m68hc11.h    (revision 132800)
--- config/m68hc11/m68hc11.h    (working copy)
*************** extern unsigned char m68hc11_reg_valid_f
*** 1266,1272 ****
  
     Pretend branches are cheap because GCC generates sub-optimal code
     for the default value.  */
! #define BRANCH_COST 0
  
  /* Nonzero if access to memory by bytes is slow and undesirable.  */
  #define SLOW_BYTE_ACCESS      0
--- 1266,1272 ----
  
     Pretend branches are cheap because GCC generates sub-optimal code
     for the default value.  */
! #define BRANCH_COST(hot_p, predictable_p) 0
  
  /* Nonzero if access to memory by bytes is slow and undesirable.  */
  #define SLOW_BYTE_ACCESS      0
Index: config/iq2000/iq2000.h
===================================================================
*** config/iq2000/iq2000.h      (revision 132800)
--- config/iq2000/iq2000.h      (working copy)
*************** typedef struct iq2000_args
*** 620,626 ****
  #define MEMORY_MOVE_COST(MODE,CLASS,TO_P)     \
    (TO_P ? 2 : 16)
  
! #define BRANCH_COST 2
  
  #define SLOW_BYTE_ACCESS 1
  
--- 620,626 ----
  #define MEMORY_MOVE_COST(MODE,CLASS,TO_P)     \
    (TO_P ? 2 : 16)
  
! #define BRANCH_COST(hot_p, predictable_p) 2
  
  #define SLOW_BYTE_ACCESS 1
  
Index: config/ia64/ia64.h
===================================================================
*** config/ia64/ia64.h  (revision 132800)
--- config/ia64/ia64.h  (working copy)
*************** do {                                                            
        \
*** 1371,1377 ****
     many additional insn groups we run into, vs how good the dynamic
     branch predictor is.  */
  
! #define BRANCH_COST 6
  
  /* Define this macro as a C expression which is nonzero if accessing less than
     a word of memory (i.e. a `char' or a `short') is no faster than accessing a
--- 1371,1377 ----
     many additional insn groups we run into, vs how good the dynamic
     branch predictor is.  */
  
! #define BRANCH_COST(hot_p, predictable_p) 6
  
  /* Define this macro as a C expression which is nonzero if accessing less than
     a word of memory (i.e. a `char' or a `short') is no faster than accessing a
Index: config/rs6000/rs6000.h
===================================================================
*** config/rs6000/rs6000.h      (revision 132800)
--- config/rs6000/rs6000.h      (working copy)
*************** extern enum rs6000_nop_insertion rs6000_
*** 950,956 ****
     Set this to 3 on the RS/6000 since that is roughly the average cost of an
     unscheduled conditional branch.  */
  
! #define BRANCH_COST 3
  
  /* Override BRANCH_COST heuristic which empirically produces worse
     performance for removing short circuiting from the logical ops.  */
--- 950,956 ----
     Set this to 3 on the RS/6000 since that is roughly the average cost of an
     unscheduled conditional branch.  */
  
! #define BRANCH_COST(hot_p, predictable_p) 3
  
  /* Override BRANCH_COST heuristic which empirically produces worse
     performance for removing short circuiting from the logical ops.  */
Index: config/arc/arc.h
===================================================================
*** config/arc/arc.h    (revision 132800)
--- config/arc/arc.h    (working copy)
*************** arc_select_cc_mode (OP, X, Y)
*** 824,830 ****
  /* The cost of a branch insn.  */
  /* ??? What's the right value here?  Branches are certainly more
     expensive than reg->reg moves.  */
! #define BRANCH_COST 2
  
  /* Nonzero if access to memory by bytes is slow and undesirable.
     For RISC chips, it means that access to memory by bytes is no
--- 824,830 ----
  /* The cost of a branch insn.  */
  /* ??? What's the right value here?  Branches are certainly more
     expensive than reg->reg moves.  */
! #define BRANCH_COST(hot_p, predictable_p) 2
  
  /* Nonzero if access to memory by bytes is slow and undesirable.
     For RISC chips, it means that access to memory by bytes is no
Index: config/score/score.h
===================================================================
*** config/score/score.h        (revision 132800)
--- config/score/score.h        (working copy)
*************** typedef struct score_args
*** 795,801 ****
    (4 + memory_move_secondary_cost ((MODE), (CLASS), (TO_P)))
  
  /* Try to generate sequences that don't involve branches.  */
! #define BRANCH_COST                     2
  
  /* Nonzero if access to memory by bytes is slow and undesirable.  */
  #define SLOW_BYTE_ACCESS                1
--- 795,801 ----
    (4 + memory_move_secondary_cost ((MODE), (CLASS), (TO_P)))
  
  /* Try to generate sequences that don't involve branches.  */
! #define BRANCH_COST(hot_p, predictable_p) 2
  
  /* Nonzero if access to memory by bytes is slow and undesirable.  */
  #define SLOW_BYTE_ACCESS                1
Index: config/arm/arm.h
===================================================================
*** config/arm/arm.h    (revision 132800)
--- config/arm/arm.h    (working copy)
*************** do {                                                    \
*** 2271,2277 ****
  
  /* Try to generate sequences that don't involve branches, we can then use
     conditional instructions */
! #define BRANCH_COST \
    (TARGET_32BIT ? 4 : (optimize > 0 ? 2 : 0))
  
  /* Position Independent Code.  */
--- 2271,2277 ----
  
  /* Try to generate sequences that don't involve branches, we can then use
     conditional instructions */
! #define BRANCH_COST(hot_p, predictable_p) \
    (TARGET_32BIT ? 4 : (optimize > 0 ? 2 : 0))
  
  /* Position Independent Code.  */
Index: config/pa/pa.h
===================================================================
*** config/pa/pa.h      (revision 132800)
--- config/pa/pa.h      (working copy)
*************** do {                                                            
        \
*** 1569,1575 ****
    : 2)
  
  /* Adjust the cost of branches.  */
! #define BRANCH_COST (pa_cpu == PROCESSOR_8000 ? 2 : 1)
  
  /* Handling the special cases is going to get too complicated for a macro,
     just call `pa_adjust_insn_length' to do the real work.  */
--- 1569,1575 ----
    : 2)
  
  /* Adjust the cost of branches.  */
! #define BRANCH_COST(hot_p, predictable_p) (pa_cpu == PROCESSOR_8000 ? 2 : 1)
  
  /* Handling the special cases is going to get too complicated for a macro,
     just call `pa_adjust_insn_length' to do the real work.  */
Index: config/mips/mips.h
===================================================================
*** config/mips/mips.h  (revision 132800)
--- config/mips/mips.h  (working copy)
*************** typedef struct mips_args {
*** 2415,2421 ****
  /* A C expression for the cost of a branch instruction.  A value of
     1 is the default; other values are interpreted relative to that.  */
  
! #define BRANCH_COST mips_branch_cost
  #define LOGICAL_OP_NON_SHORT_CIRCUIT 0
  
  /* If defined, modifies the length assigned to instruction INSN as a
--- 2415,2421 ----
  /* A C expression for the cost of a branch instruction.  A value of
     1 is the default; other values are interpreted relative to that.  */
  
! #define BRANCH_COST(hot_p, predictable_p) mips_branch_cost
  #define LOGICAL_OP_NON_SHORT_CIRCUIT 0
  
  /* If defined, modifies the length assigned to instruction INSN as a
Index: config/vax/vax.h
===================================================================
*** config/vax/vax.h    (revision 132800)
--- config/vax/vax.h    (working copy)
*************** enum reg_class { NO_REGS, ALL_REGS, LIM_
*** 652,658 ****
     Branches are extremely cheap on the VAX while the shift insns often
     used to replace branches can be expensive.  */
  
! #define BRANCH_COST 0
  
  /* Tell final.c how to eliminate redundant test instructions.  */
  
--- 652,658 ----
     Branches are extremely cheap on the VAX while the shift insns often
     used to replace branches can be expensive.  */
  
! #define BRANCH_COST(hot_p, predictable_p) 0
  
  /* Tell final.c how to eliminate redundant test instructions.  */
  
Index: config/h8300/h8300.h
===================================================================
*** config/h8300/h8300.h        (revision 132800)
--- config/h8300/h8300.h        (working copy)
*************** struct cum_arg
*** 1004,1010 ****
  #define DELAY_SLOT_LENGTH(JUMP) \
    (NEXT_INSN (PREV_INSN (JUMP)) == JUMP ? 0 : 2)
  
! #define BRANCH_COST 0
  
  /* Tell final.c how to eliminate redundant test instructions.  */
  
--- 1004,1010 ----
  #define DELAY_SLOT_LENGTH(JUMP) \
    (NEXT_INSN (PREV_INSN (JUMP)) == JUMP ? 0 : 2)
  
! #define BRANCH_COST(hot_p, predictable_p) 0
  
  /* Tell final.c how to eliminate redundant test instructions.  */
  
Index: params.def
===================================================================
*** params.def  (revision 132800)
--- params.def  (working copy)
*************** DEFPARAM (PARAM_STRUCT_REORG_COLD_STRUCT
*** 93,98 ****
--- 93,105 ----
          "The threshold ratio between current and hottest structure counts",
          10, 0, 100)
  
+ /* When branch is predicted to be taken with probability lower than this
+    threshold (in percent), then it is considered well predictable. */
+ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME,
+         "predictable-branch-outcome",
+         "Maximal esitmated outcome of branch considered predictable",
+         2, 0, 50)
+ 
  /* The single function inlining limit. This is the maximum size
     of a function counted in internal gcc instructions (not in
     real machine instructions) that is eligible for inlining
Index: doc/tm.texi
===================================================================
*** doc/tm.texi (revision 132800)
--- doc/tm.texi (working copy)
*************** value to the result of that function.  T
*** 5828,5836 ****
  are the same as to this macro.
  @end defmac
  
! @defmac BRANCH_COST
! A C expression for the cost of a branch instruction.  A value of 1 is
! the default; other values are interpreted relative to that.
  @end defmac
  
  Here are additional macros which do not specify precise relative costs,
--- 5828,5841 ----
  are the same as to this macro.
  @end defmac
  
! @defmac BRANCH_COST (@var{hot_p}, @var{predictable_p})
! A C expression for the cost of a branch instruction.  A value of 1 is the
! default; other values are interpreted relative to that. Parameter @var{hot_p}
! is true when the branch in question might be hot in the compiled program.  
When
! it is false, @code{BRANCH_COST} should be returning value optimal for code 
size
! rather then performance considerations.  @var{predictable_p} is true for well
! predictable branches. On many architectures the @code{BRANCH_COST} can be
! reduced then.
  @end defmac
  
  Here are additional macros which do not specify precise relative costs,
Index: doc/invoke.texi
===================================================================
*** doc/invoke.texi     (revision 132800)
--- doc/invoke.texi     (working copy)
*************** to the hottest structure frequency in th
*** 6807,6812 ****
--- 6807,6816 ----
  parameter, then structure reorganization is not applied to this structure.
  The default is 10.
  
+ @item predictable-branch-cost-outcome
+ When branch is predicted to be taken with probability lower than this 
threshold
+ (in percent), then it is considered well predictable. The default is 10.
+ 
  @item max-crossjump-edges
  The maximum number of incoming edges to consider for crossjumping.
  The algorithm used by @option{-fcrossjumping} is @math{O(N^2)} in

[RFA] optimizing predictable branches on x86

Reply via email to