8 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

jakub at gcc dot gnu.org Fri, 23 Mar 2018 11:31:51 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200


Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #22 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
If the problem is that (In reply to Richard Biener from comment #9)
> So RTL expansion ends up in
> 
>           /* If jumps are cheap and the target does not support conditional
>              compare, turn some more codes into jumpy sequences.  */
>           else if (BRANCH_COST (optimize_insn_for_speed_p (), false) < 4
>                    && targetm.gen_ccmp_first == NULL)
>             {
>               if ((code2 == BIT_AND_EXPR
>                    && TYPE_PRECISION (TREE_TYPE (op0)) == 1
>                    && TREE_CODE (gimple_assign_rhs2 (second)) != INTEGER_CST)
>                   || code2 == TRUTH_AND_EXPR)
>                 {
>                   code = TRUTH_ANDIF_EXPR;
>                   op0 = gimple_assign_rhs1 (second);
>                   op1 = gimple_assign_rhs2 (second);
> 
> where we could adjust operand order based on the immediately dominating
> condition.
> 
> Unfortunately sth as simple as
> 
>                   /* We'll expand RTL for op0 first, see if we'd better
>                      expand RTL for op1 first.  */
>                   if (TREE_CODE (op1) == SSA_NAME
>                       && single_pred_p (bb))
>                     {
>                       gimple *def1 = SSA_NAME_DEF_STMT (op1);
>                       if (is_gimple_assign (def1)
>                           && TREE_CODE_CLASS (gimple_assign_rhs_code (def1))
> ==
> tcc_comparison)
>                         {
>                           basic_block pred = single_pred (bb);
>                           gimple *last = last_stmt (pred);
>                           if (last
>                               && gimple_code (last) == GIMPLE_COND
>                               && gimple_assign_rhs1 (def1) ==
> gimple_cond_lhs (last))
>                             std::swap (op0, op1);
>                         }
>                     }
> 
> doesn't work as the predecessor is no longer in GIMPLE (we dropped the seq
> for the GIMPLE stmts and GIMPLE_CONDs have no DEFs...).  Also I'm not sure
> the half-way RTL CFG will still point to the original block.
> 
> Of course the above heuristic is really only applicable if there's not much
> code expanded between this jump and the one in the predecessor.
> 
> OTOH if we have the BRANCH_COST check during RTL expansion (similar to
> what we have for LOGICAL_OP_NON_SHORT_CIRCUIT in fold-const.c) then maybe
> if-combining shouldn't combine the conditionals.  There's a slight disconnect
> here, the above is BRANCH_COST < 4 while the other is BRANCH_COST >= 2 ...
> 
> The cfgexpand code could also be done as a pre-pass on the IL turning
> the straight-line code back to CFG (I guess that's a good idea anyway).

Well, we could have some early expansion mini-pass or very late pass that would
try to detect these cases guarded with the if (BRANCH_COST
(optimize_bb_for_speed_p (bb), false) < 4) condition, in particular, bb with
single pred ending with: if (something op const) and afterwards
boolean BIT_{AND,IOR}_EXPR where the bb contains no (non-virtual) phis and only
TERed stmts before the BIT_{AND,IOR}_EXPR and if the second of the conditions
in there is something op const, swap the operands of the BIT_{AND,IOR}_EXPR. 
That way at actual expansion time when the predecessor is already converted
into BB_RTL form we don't need to give up.

I guess the important question is if that actually helps this benchmark.

On the #c4 testcase if I compile with -Ofast -mavx2, I get:
        .p2align 4,,10
        .p2align 3
.L10:
        cmpl    $2, %esi
        jne     .L9
        testq   %r8, %r8
        jg      .L11
        .p2align 4,,10
        .p2align 3
.L9:
        addq    %r9, %rax
        cmpq    %rax, %rdx
        jbe     .L31
.L12:
        movl    24(%rax), %esi
        testl   %esi, %esi
        jle     .L9
        movq    (%rax), %r8
        testq   %r8, %r8
        jns     .L10
        cmpl    $1, %esi
        jne     .L9
.L11:
        addq    $1, %r10
        movq    st5(,%r10,8), %rsi
        movq    %rax, (%rsi)
        addq    %r9, %rax
        movq    %r8, 8(%rsi)
        movq    st5(,%rdi,8), %rsi
        movq    %r8, 16(%rsi)
        cmpq    %rax, %rdx
        ja      .L12
while if I in the debugger swap op0 and op1 on the above #c9 spot, I get:
        .p2align 4,,10
        .p2align 3
.L10:
        jle     .L9
        cmpl    $2, %esi
        je      .L11
        .p2align 4,,10
        .p2align 3
.L9:
        addq    %r9, %rax
        cmpq    %rax, %rdx
        jbe     .L31
.L12:
        movl    24(%rax), %esi
        testl   %esi, %esi
        jle     .L9
        movq    (%rax), %r8
        testq   %r8, %r8
        jns     .L10
        cmpl    $1, %esi
        jne     .L9
.L11:
        addq    $1, %r10
        movq    st5(,%r10,8), %rsi
        movq    %rax, (%rsi)
        addq    %r9, %rax
        movq    %r8, 8(%rsi)
        movq    st5(,%rdi,8), %rsi
        movq    %r8, 16(%rsi)
        cmpq    %rax, %rdx
        ja      .L12
instead, difference being:
 .L10:
+       jle     .L9
        cmpl    $2, %esi
-       jne     .L9
-       testq   %r8, %r8
-       jg      .L11
+       je      .L11

It even wouldn't need to be a separate walk over all bbs, just a function that
would do this given a bb that would do something if both that bb and its
single_pred are still not & BB_RTL, and do that both from reorder_operands and
from expand_gimple_basic_block with EDGE_TRUE/EDGE_FALSE successors for the
successors.  The only issue is that reorder_operands might actually undo that
if the operand we swap as second seems to be more expensive.

Another possibility is to remember the old gimple seq of the basic blocks
somewhere (array indexed by bb number) and only throw it away after all bbs are
expanded.

[Bug rtl-optimization/78200] [7/8 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

Reply via email to