https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119919
Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2025-04-24 Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> --- This is with -O2 only. Difference is +++ bbb 2025-04-24 16:21:25.029155295 +0200 @@ -108,10 +108,7 @@ exchange2.fppized.f90:1027:58: optimized: loop vectorized using 16 byte vectors exchange2.fppized.f90:1019:71: optimized: loop vectorized using 8 byte vectors exchange2.fppized.f90:1016:55: optimized: loop vectorized using 16 byte vectors -exchange2.fppizedf90:1003:32: optimized: loop vectorized using 8 byte vectors exchange2.fppized.f90:1123:83: optimized: loop with 1 iterations completely unrolled (header execution count 119292720) -exchange2.fppized.f90:1003:32: optimized: loop turned into non-loop; it never loops -exchange2.fppized.f90:1003:32: optimized: loop turned into non-loop; it never loops exchange2.fppized.f90:1203:51: optimized: loop unrolled 1 times exchange2.fppized.f90:1194:54: optimized: loop unrolled 1 times exchange2.fppized.f90:1185:57: optimized: loop unrolled 1 times before patch we get *_45 1 times scalar_load costs 12 in prologue u[_47] 1 times scalar_load costs 12 in prologue _46 ? _ifc__1856 : 9 1 times scalar_stmt costs 4 in prologue _ifc__1854 1 times scalar_store costs 12 in prologue sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body _7 != 0 4 times vector_stmt costs 16 in body <unknown> 1 times vector_load costs 12 in prologue _8 ? 1 : 0 4 times vector_stmt costs 16 in body <unknown> 1 times vector_load costs 12 in prologue <unknown> 1 times vector_load costs 12 in prologue (unsigned char) patt_2784 1 times vec_promote_demote costs 4 in body (unsigned char) patt_2784 2 times vec_promote_demote costs 8 in body patt_2785 1 times vector_store costs 12 in body exchange2.fppized.f90:1003:32: note: Cost model analysis: Vector inside of loop cost: 168 Vector prologue cost: 36 Vector epilogue cost: 28 Scalar iteration cost: 28 Scalar outside cost: 0 Vector outside cost: 64 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 7 <bb 3> [local count: 6974165]: # _1815 = PHI <_13(294), 1(2)> # _905 = PHI <_12(294), 0(2)> # ivtmp_1876 = PHI <ivtmp_1875(294), 9(2)> # ivtmp_2808 = PHI <ivtmp_2809(294), _2807(2)> # vectp_temp.3211_2840 = PHI <vectp_temp.3211_2841(294), &temp.862(2)> # ivtmp_2843 = PHI <ivtmp_2844(294), 0(2)> _5 = _1815 * 9; _6 = _3 + _5; _2810 = MEM[(int *)ivtmp_2808]; ivtmp_2811 = ivtmp_2808 + 36; _2812 = MEM[(int *)ivtmp_2811]; ivtmp_2813 = ivtmp_2811 + 36; vect_cst__2814 = {_2810, _2812}; _2815 = MEM[(int *)ivtmp_2813]; ivtmp_2816 = ivtmp_2813 + 36; _2817 = MEM[(int *)ivtmp_2816]; ivtmp_2818 = ivtmp_2816 + 36; vect_cst__2819 = {_2815, _2817}; _2820 = MEM[(int *)ivtmp_2818]; ivtmp_2821 = ivtmp_2818 + 36; _2822 = MEM[(int *)ivtmp_2821]; ivtmp_2823 = ivtmp_2821 + 36; vect_cst__2824 = {_2820, _2822}; _2825 = MEM[(int *)ivtmp_2823]; ivtmp_2826 = ivtmp_2823 + 36; _2827 = MEM[(int *)ivtmp_2826]; vect_cst__2828 = {_2825, _2827}; mask__8.3207_2829 = { 0, 0 } != vect_cst__2814; mask__8.3207_2830 = { 0, 0 } != vect_cst__2819; mask__8.3207_2831 = { 0, 0 } != vect_cst__2824; mask__8.3207_2832 = { 0, 0 } != vect_cst__2828; vect_patt_2784.3208_2833 = VEC_COND_EXPR <mask__8.3207_2829, { 1, 1 }, { 0, 0 }>; vect_patt_2784.3208_2834 = VEC_COND_EXPR <mask__8.3207_2830, { 1, 1 }, { 0, 0 }>; vect_patt_2784.3208_2835 = VEC_COND_EXPR <mask__8.3207_2831, { 1, 1 }, { 0, 0 }>; vect_patt_2784.3208_2836 = VEC_COND_EXPR <mask__8.3207_2832, { 1, 1 }, { 0, 0 }>; vect_patt_2785.3210_2837 = VEC_PACK_TRUNC_EXPR <vect_patt_2784.3208_2833, vect_patt_2784.3208_2834>; vect_patt_2785.3210_2838 = VEC_PACK_TRUNC_EXPR <vect_patt_2784.3208_2835, vect_patt_2784.3208_2836>; vect_patt_2785.3209_2839 = VEC_PACK_TRUNC_EXPR <vect_patt_2785.3210_2837, vect_patt_2785.3210_2838>; _7 = sudoku1[_6]; _8 = _7 != 0; _10 = (sizetype) _905; _11 = &temp.862 + _10; MEM <vector(8) unsigned char> [(logical(kind=1) *)vectp_temp.3211_2840] = vect_patt_2785.3209_2839; _12 = _905 + 1; _13 = _1815 + 1; ivtmp_1875 = ivtmp_1876 - 1; ivtmp_2809 = ivtmp_2808 + 288; vectp_temp.3211_2841 = vectp_temp.3211_2840 + 8; ivtmp_2844 = ivtmp_2843 + 1; vectp_temp.3211_2841 = vectp_temp.3211_2840 + 8; ivtmp_2844 = ivtmp_2843 + 1; if (ivtmp_2844 >= 1) goto <bb 580>; [100.00%] else goto <bb 294>; [0.00%] after patch *_45 1 times scalar_load costs 12 in prologue u[_47] 1 times scalar_load costs 12 in prologue _46 ? _ifc__1856 : 9 1 times scalar_stmt costs 8 in prologue _ifc__1854 1 times scalar_store costs 12 in prologue sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times scalar_load costs 12 in body sudoku1[_6] 1 times vec_construct costs 4 in body _7 != 0 4 times vector_stmt costs 16 in body <unknown> 1 times vector_load costs 12 in prologue _8 ? 1 : 0 4 times vector_stmt costs 64 in body <unknown> 1 times vector_load costs 12 in prologue <unknown> 1 times vector_load costs 12 in prologue (unsigned char) patt_2784 1 times vec_promote_demote costs 4 in body (unsigned char) patt_2784 2 times vec_promote_demote costs 8 in body Vector inside of loop cost: 216 Vector prologue cost: 36 Vector epilogue cost: 28 Scalar iteration cost: 28 Scalar outside cost: 0 Vector outside cost: 64 prologue iterations: 0 epilogue iterations: 1 <bb 3> [local count: 62767486]: # _1815 = PHI <_13(294), 1(2)> # _905 = PHI <_12(294), 0(2)> # ivtmp_1876 = PHI <ivtmp_1875(294), 9(2)> _5 = _1815 * 9; _6 = _3 + _5; _7 = sudoku1[_6]; _8 = _7 != 0; _10 = (sizetype) _905; _11 = &temp.862 + _10; *_11 = _8; _12 = _905 + 1; _13 = _1815 + 1; ivtmp_1875 = ivtmp_1876 - 1; if (ivtmp_1875 == 0) goto <bb 230>; [11.11%] else goto <bb 294>; [88.89%] So the loop iterates 9 times and I guess main reason why it is profitable is elimination of it. Since we now cost _8 ? 1 : 0 4 times as 64 instead of 16, we decide to not vectorize.