https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

            Bug ID: 78200
           Summary: [7 regression]: 429.mcf of cpu2006 regresses in GCC
                    trunk for avx2 target.
           Product: gcc
           Version: tree-ssa
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: venkataramanan.kumar at amd dot com
  Target Milestone: ---

Noticed 5% regression with 429.mcf of cpu2006 on x86_64 AVX2 (bdver4) with GCC
trunk gcc version 7.0.0 20161028 (experimental) (GCC).

Flag used is -O3 -mavx2 -mprefer-avx128

Not seen with GCC 6.1 or with GCC trunk for -O3 -mavx -mprefer-avx128

Assembly difference is observed in hot function primal_bea_mpp of pbeampp.c. 

-O3 -mavx -mprefer-avx128               -O3 -mavx2 -mprefer-avx128

.L98:                                 |  .L98:
  ------------------------------------|          jle     .L97 <==  order of
comparison 
          cmpl    $2, %r9d            |          cmpl    $2, %r9d  is
different.
          jne     .L97                |          jne     .L97
          testq   %rdi, %rdi          |  -----------------------------------
          jle     .L97                |  -----------------------------------
  .L99:                               |  .L99:
          addq    $1, %r13            |          addq    $1, %r13
          movq    %rdi, %r12          |          movq    %rdi, %r12
          movq    perm(,%r13,8), %r9  |          movq    perm(,%r13,8), %r9
          sarq    $63, %r12           |          sarq    $63, %r12
          movq    %rdi, 8(%r9)        |          movq    %rdi, 8(%r9)
+ +-- 12 lines: xorq %r12, %rdi-------|+ +-- 12 lines: xorq %r12, %rdi------
          jle     .L97                |          jle     .L97
          movq    8(%rax), %r14       |          movq    8(%rax), %r14
          movq    (%rax), %rdi        |          movq    (%rax), %rdi
          subq    (%r14), %rdi        |          subq    (%r14), %rdi
          movq    16(%rax), %r14      |          movq    16(%rax), %r14
          addq    (%r14), %rdi        |          addq    (%r14), %rdi
          jns     .L98                |          cmpq    $0, %rdi
  ------------------------------------|          jge     .L98


Gimple optimzied dump shows 

GCC trunk -O3 -mavx -mprefer-avx128 
;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  # RANGE [0, 1]
  _496 = _512 == 2;
  # RANGE [0, 1]
  _495 = red_cost_503 > 0;
  # RANGE [0, 1]
  _494 = _495 & _496;
  if (_494 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;


GCC trunk -O3 -mavx2 -mprefer-avx128 

;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  # RANGE [0, 1]
  _496 = _512 == 2;
  # RANGE [0, 1]
  _495 = red_cost_503 > 0;  
  # RANGE [0, 1]
  _494 = _495 & _496; <== operation order is different on AVX2.
  if (_494 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;

operation order is changed at pbeampp.c.171t.reassoc2.
;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _496 = _512 == 2;
  _495 = red_cost_503 > 0;
  _494 = _495 & _496;
  if (_494 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;

Looking backwards further, found that in tree if conversion generates
non-canonical gimple. 
pbeampp.c.155t.ifcvt

;;   basic block 27, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 26, next block 28, flags: (NEW, REACHABLE, VISITED)
;;    pred:       25 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _496 = _512 == 2;
  _495 = red_cost_503 > 0;
  _494 = _496 & _495;    <== comparison order is same but LHS of "&" has a
greater number.
  if (_494 != 0)
    goto <bb 28>;
  else
    goto <bb 29>;


pbeampp.c.154t.ch_vect
;;   basic block 23, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 22, next block 24, flags: (NEW, REACHABLE, VISITED)
;;    pred:       21 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _340 = _23 == 2;
  _341 = red_cost_86 > 0;
  _338 = _340 & _341;  <==  comparison order is same here.
  if (_338 != 0)
    goto <bb 24>;
  else
    goto <bb 25>;



compiling pbeampp.c with -O3 -mavx2 -mprefer-avx128 
-fno-tree-loop-if-conversion 
and rest of benchmark changes with  -O3 -mavx2 -mprefer-avx128 brings back the
score same as that of 
-O3 -mavx  or GCC 6.1 -O3 -mavx2.

Reply via email to