[Bug target/119919] 7% exchange2 regression between g:6390fc86995fbd5239497cb9e1797a3af51d3936 and g:f72a2d221539cede358f2487b94bc370c6fc44b5

hubicka at gcc dot gnu.org via Gcc-bugs Thu, 24 Apr 2025 07:50:46 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119919


Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2025-04-24
           Assignee|unassigned at gcc dot gnu.org      |hubicka at gcc dot 
gnu.org

--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
This is with -O2 only. Difference is
+++ bbb 2025-04-24 16:21:25.029155295 +0200
@@ -108,10 +108,7 @@
 exchange2.fppized.f90:1027:58: optimized: loop vectorized using 16 byte
vectors
 exchange2.fppized.f90:1019:71: optimized: loop vectorized using 8 byte vectors
 exchange2.fppized.f90:1016:55: optimized: loop vectorized using 16 byte
vectors
-exchange2.fppizedf90:1003:32: optimized: loop vectorized using 8 byte vectors
 exchange2.fppized.f90:1123:83: optimized: loop with 1 iterations completely
unrolled (header execution count 119292720)
-exchange2.fppized.f90:1003:32: optimized: loop turned into non-loop; it never
loops
-exchange2.fppized.f90:1003:32: optimized: loop turned into non-loop; it never
loops
 exchange2.fppized.f90:1203:51: optimized: loop unrolled 1 times
 exchange2.fppized.f90:1194:54: optimized: loop unrolled 1 times
 exchange2.fppized.f90:1185:57: optimized: loop unrolled 1 times

before patch we get

*_45 1 times scalar_load costs 12 in prologue
u[_47] 1 times scalar_load costs 12 in prologue
_46 ? _ifc__1856 : 9 1 times scalar_stmt costs 4 in prologue
_ifc__1854 1 times scalar_store costs 12 in prologue

sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body 
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
_7 != 0 4 times vector_stmt costs 16 in body
<unknown> 1 times vector_load costs 12 in prologue
_8 ? 1 : 0 4 times vector_stmt costs 16 in body
<unknown> 1 times vector_load costs 12 in prologue
<unknown> 1 times vector_load costs 12 in prologue
(unsigned char) patt_2784 1 times vec_promote_demote costs 4 in body
(unsigned char) patt_2784 2 times vec_promote_demote costs 8 in body
patt_2785 1 times vector_store costs 12 in body


exchange2.fppized.f90:1003:32: note:  Cost model analysis:
  Vector inside of loop cost: 168
  Vector prologue cost: 36
  Vector epilogue cost: 28
  Scalar iteration cost: 28
  Scalar outside cost: 0
  Vector outside cost: 64
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 7

  <bb 3> [local count: 6974165]:
  # _1815 = PHI <_13(294), 1(2)>
  # _905 = PHI <_12(294), 0(2)>
  # ivtmp_1876 = PHI <ivtmp_1875(294), 9(2)>
  # ivtmp_2808 = PHI <ivtmp_2809(294), _2807(2)>
  # vectp_temp.3211_2840 = PHI <vectp_temp.3211_2841(294), &temp.862(2)>
  # ivtmp_2843 = PHI <ivtmp_2844(294), 0(2)>
  _5 = _1815 * 9; 
  _6 = _3 + _5;
  _2810 = MEM[(int *)ivtmp_2808];
  ivtmp_2811 = ivtmp_2808 + 36;
  _2812 = MEM[(int *)ivtmp_2811];
  ivtmp_2813 = ivtmp_2811 + 36;
  vect_cst__2814 = {_2810, _2812};
  _2815 = MEM[(int *)ivtmp_2813];
  ivtmp_2816 = ivtmp_2813 + 36;
  _2817 = MEM[(int *)ivtmp_2816];
  ivtmp_2818 = ivtmp_2816 + 36;
  vect_cst__2819 = {_2815, _2817};
  _2820 = MEM[(int *)ivtmp_2818];
  ivtmp_2821 = ivtmp_2818 + 36;
  _2822 = MEM[(int *)ivtmp_2821];
  ivtmp_2823 = ivtmp_2821 + 36;
  vect_cst__2824 = {_2820, _2822};
  _2825 = MEM[(int *)ivtmp_2823];
  ivtmp_2826 = ivtmp_2823 + 36;
  _2827 = MEM[(int *)ivtmp_2826];
  vect_cst__2828 = {_2825, _2827};
  mask__8.3207_2829 = { 0, 0 } != vect_cst__2814;
  mask__8.3207_2830 = { 0, 0 } != vect_cst__2819;
  mask__8.3207_2831 = { 0, 0 } != vect_cst__2824;
  mask__8.3207_2832 = { 0, 0 } != vect_cst__2828;
  vect_patt_2784.3208_2833 = VEC_COND_EXPR <mask__8.3207_2829, { 1, 1 }, { 0, 0
}>;
  vect_patt_2784.3208_2834 = VEC_COND_EXPR <mask__8.3207_2830, { 1, 1 }, { 0, 0
}>;
  vect_patt_2784.3208_2835 = VEC_COND_EXPR <mask__8.3207_2831, { 1, 1 }, { 0, 0
}>;
  vect_patt_2784.3208_2836 = VEC_COND_EXPR <mask__8.3207_2832, { 1, 1 }, { 0, 0
}>;
  vect_patt_2785.3210_2837 = VEC_PACK_TRUNC_EXPR <vect_patt_2784.3208_2833,
vect_patt_2784.3208_2834>;
  vect_patt_2785.3210_2838 = VEC_PACK_TRUNC_EXPR <vect_patt_2784.3208_2835,
vect_patt_2784.3208_2836>;
  vect_patt_2785.3209_2839 = VEC_PACK_TRUNC_EXPR <vect_patt_2785.3210_2837,
vect_patt_2785.3210_2838>;
  _7 = sudoku1[_6];
  _8 = _7 != 0; 
  _10 = (sizetype) _905;
  _11 = &temp.862 + _10;
  MEM <vector(8) unsigned char> [(logical(kind=1) *)vectp_temp.3211_2840] =
vect_patt_2785.3209_2839;
  _12 = _905 + 1; 
  _13 = _1815 + 1;
  ivtmp_1875 = ivtmp_1876 - 1;
  ivtmp_2809 = ivtmp_2808 + 288;
  vectp_temp.3211_2841 = vectp_temp.3211_2840 + 8;
  ivtmp_2844 = ivtmp_2843 + 1;
  vectp_temp.3211_2841 = vectp_temp.3211_2840 + 8;
  ivtmp_2844 = ivtmp_2843 + 1;
  if (ivtmp_2844 >= 1)
    goto <bb 580>; [100.00%]
  else
    goto <bb 294>; [0.00%]


after patch

*_45 1 times scalar_load costs 12 in prologue
u[_47] 1 times scalar_load costs 12 in prologue
_46 ? _ifc__1856 : 9 1 times scalar_stmt costs 8 in prologue
_ifc__1854 1 times scalar_store costs 12 in prologue
sudoku1[_6] 1 times scalar_load costs 12 in body 
sudoku1[_6] 1 times scalar_load costs 12 in body 
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body 
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
_7 != 0 4 times vector_stmt costs 16 in body
<unknown> 1 times vector_load costs 12 in prologue
_8 ? 1 : 0 4 times vector_stmt costs 64 in body
<unknown> 1 times vector_load costs 12 in prologue
<unknown> 1 times vector_load costs 12 in prologue
(unsigned char) patt_2784 1 times vec_promote_demote costs 4 in body
(unsigned char) patt_2784 2 times vec_promote_demote costs 8 in body

  Vector inside of loop cost: 216 
  Vector prologue cost: 36
  Vector epilogue cost: 28
  Scalar iteration cost: 28
  Scalar outside cost: 0
  Vector outside cost: 64
  prologue iterations: 0
  epilogue iterations: 1

  <bb 3> [local count: 62767486]:
  # _1815 = PHI <_13(294), 1(2)>
  # _905 = PHI <_12(294), 0(2)>
  # ivtmp_1876 = PHI <ivtmp_1875(294), 9(2)>
  _5 = _1815 * 9;
  _6 = _3 + _5;
  _7 = sudoku1[_6];
  _8 = _7 != 0;
  _10 = (sizetype) _905;
  _11 = &temp.862 + _10;
  *_11 = _8;
  _12 = _905 + 1;
  _13 = _1815 + 1;
  ivtmp_1875 = ivtmp_1876 - 1;
  if (ivtmp_1875 == 0)
    goto <bb 230>; [11.11%]
  else
    goto <bb 294>; [88.89%]

So the loop iterates 9 times and I guess main reason why it is profitable is
elimination of it.
Since we now cost _8 ? 1 : 0 4 times as 64 instead of 16, we decide to not
vectorize.

[Bug target/119919] 7% exchange2 regression between g:6390fc86995fbd5239497cb9e1797a3af51d3936 and g:f72a2d221539cede358f2487b94bc370c6fc44b5

Reply via email to