https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104010
Richard Earnshaw <rearnsha at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|WAITING |NEW --- Comment #6 from Richard Earnshaw <rearnsha at gcc dot gnu.org> --- The reason this wasn't reproducible is because there is a typo in the testcase - the loop iteration count should be 2 not 4. Clues are in the function name and the assembly code generated, which both show 2 iterations of the loop. Changing the test to: void test_vcmpeq_s32x2 (int32_t * __restrict__ dest, int32_t *a, int32_t *b) { int i; for (i=0; i<2; i++) { dest[i] = a[i] == b[i]; } } Does indeed show a regression between gcc-11 and trunk. With gcc-11 the costing shows: vect.c:5:13: note: Cost model analysis: 0x2f0a780 _28 1 times scalar_store costs 1 in body 0x2f0a780 _41 1 times scalar_store costs 1 in body 0x2f0a780 (int) _26 1 times scalar_stmt costs 1 in body 0x2f0a780 (int) _39 1 times scalar_stmt costs 1 in body 0x2f0a780 _23 == _25 1 times scalar_stmt costs 1 in body 0x2f0a780 _36 == _38 1 times scalar_stmt costs 1 in body 0x2f0a780 *a_13(D) 1 times scalar_load costs 1 in body 0x2f0a780 MEM[(int *)a_13(D) + 4B] 1 times scalar_load costs 1 in body 0x2f0a780 *b_14(D) 1 times scalar_load costs 1 in body 0x2f0a780 MEM[(int *)b_14(D) + 4B] 1 times scalar_load costs 1 in body 0x2f0a780 *a_13(D) 1 times unaligned_load (misalign -1) costs 1 in body 0x2f0a780 *b_14(D) 1 times unaligned_load (misalign -1) costs 1 in body 0x2f0a780 _23 == _25 1 times vector_stmt costs 1 in body 0x2f0a780 _26 ? 1 : 0 1 times vector_stmt costs 1 in body 0x2f0a780 <unknown> 1 times vector_load costs 1 in prologue 0x2f0a780 <unknown> 1 times vector_load costs 1 in prologue 0x2f0a780 _28 1 times unaligned_store (misalign -1) costs 1 in body vect.c:5:13: note: Cost model analysis for part in loop 0: Vector cost: 7 Scalar cost: 10 While trunk shows: vect.c:5:13: note: Cost model analysis: _28 1 times scalar_store costs 1 in body _41 1 times scalar_store costs 1 in body (int) _26 1 times scalar_stmt costs 1 in body (int) _39 1 times scalar_stmt costs 1 in body *a_13(D) 1 times unaligned_load (misalign -1) costs 1 in body *b_14(D) 1 times unaligned_load (misalign -1) costs 1 in body _23 == _25 1 times vector_stmt costs 1 in body _26 ? 1 : 0 1 times vector_stmt costs 1 in body node 0x3bc5078 1 times vector_load costs 1 in prologue node 0x3bc5100 1 times vector_load costs 1 in prologue _28 1 times unaligned_store (misalign -1) costs 1 in body vect.c:5:13: note: Cost model analysis for part in loop 0: Vector cost: 7 Scalar cost: 4 vect.c:5:13: missed: not vectorized: vectorization is not profitable. Now the question is why has the scalar cost has been so dramatically reduced?