https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- The loop is like doublefoo (double* a, unsigned* b, double* c, int n) { double sum = 0; for (int i = 0; i != n; i++) { sum += a[i] * c[b[i]]; } return sum; } After disabling gather, is use gather scalar emulation and the cost model is only profitable for xmm not ymm, which cause the regression. When manually add -fno-vect-cost-model, the regression is almost gone. microbenchmark data [liuhongt@intel gather_emulation]$ ./gather.out ;./nogather_xmm.out;./nogather_ymm.out elapsed time: 1.75997 seconds for gather with 30000000 iterations elapsed time: 2.42473 seconds for no_gather_xmm with 30000000 iterations elapsed time: 1.86436 seconds for no_gather_ymm with 30000000 iterations And I looked at the cost model 299_13 + sum_24 1 times scalar_to_vec costs 4 in prologue 300_13 + sum_24 1 times vector_stmt costs 16 in epilogue 301_13 + sum_24 1 times vec_to_scalar costs 4 in epilogue 302_13 + sum_24 2 times vector_stmt costs 32 in body 303*_3 1 times unaligned_load (misalign -1) costs 16 in body 304*_3 1 times unaligned_load (misalign -1) costs 16 in body 305*_7 1 times unaligned_load (misalign -1) costs 16 in body 306(long unsigned int) _8 2 times vec_promote_demote costs 8 in body 307*_11 4 times vec_to_scalar costs 80 in body 308*_11 4 times scalar_load costs 64 in body 309*_11 1 times vec_construct costs 120 in body 310*_11 4 times vec_to_scalar costs 80 in body 311*_11 4 times scalar_load costs 64 in body 312*_11 1 times vec_construct costs 120 in body 313_4 * _12 2 times vector_stmt costs 32 in body 314test.c:6:21: note: operating on full vectors. 315test.c:6:21: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . 316*_3 4 times scalar_load costs 64 in epilogue 317*_7 4 times scalar_load costs 48 in epilogue 318(long unsigned int) _8 4 times scalar_stmt costs 16 in epilogue 319*_11 4 times scalar_load costs 64 in epilogue 320_4 * _12 4 times scalar_stmt costs 64 in epilogue 321_13 + sum_24 4 times scalar_stmt costs 64 in epilogue 322<unknown> 1 times cond_branch_taken costs 12 in epilogue 323test.c:6:21: note: Cost model analysis: 324 Vector inside of loop cost: 648 325 Vector prologue cost: 4 326 Vector epilogue cost: 352 327 Scalar iteration cost: 80 328 Scalar outside cost: 24 329 Vector outside cost: 356 330 prologue iterations: 0 331 epilogue iterations: 4 332test.c:6:21: missed: cost model: the vector iteration cost = 648 divided by the scalar iteration cost = 80 is greater or equal to the vectorization factor = 8. For gather emulation part, it tries to generate below 2734 <bb 18> [local count: 83964060]: 2735 bnd.23_154 = niters.22_130 >> 2; 2736 _165 = (sizetype) _65; 2737 _166 = _165 * 8; 2738 vectp_a.28_164 = a_18(D) + _166; 2739 _174 = _165 * 4; 2740 vectp_b.32_172 = b_19(D) + _174; 2741 _180 = (sizetype) c_20(D); 2742 vect__33.29_169 = MEM <vector(2) double> [(double *)vectp_a.28_164]; 2743 vectp_a.27_170 = vectp_a.28_164 + 16; 2744 vect__33.30_171 = MEM <vector(2) double> [(double *)vectp_a.27_170]; 2745 vect__30.33_177 = MEM <vector(4) unsigned int> [(unsigned int *)vectp_b.32_172]; 2746 vect__29.34_178 = [vec_unpack_lo_expr] vect__30.33_177; 2747 vect__29.34_179 = [vec_unpack_hi_expr] vect__30.33_177; 2748 _181 = BIT_FIELD_REF <vect__29.34_178, 64, 0>; 2749 _182 = _181 * 8; 2750 _183 = _180 + _182; 2751 _184 = (void *) _183; 2752 _185 = MEM[(double *)_184]; 2753 _186 = BIT_FIELD_REF <vect__29.34_178, 64, 64>; 2754 _187 = _186 * 8; 2755 _188 = _180 + _187; 2756 _189 = (void *) _188; 2757 _190 = MEM[(double *)_189]; 2758 vect__23.35_191 = {_185, _190}; 2759 _192 = BIT_FIELD_REF <vect__29.34_179, 64, 0>; 2760 _193 = _192 * 8; 2761 _194 = _180 + _193; 2762 _195 = (void *) _194; 2763 _196 = MEM[(double *)_195]; 2764 _197 = BIT_FIELD_REF <vect__29.34_179, 64, 64>; 2765 _198 = _197 * 8; 2766 _199 = _180 + _198; 2767 _200 = (void *) _199; 2768 _201 = MEM[(double *)_200]; 2769 vect__23.36_202 = {_196, _201}; 2770 vect__15.37_203 = vect__33.29_169 * vect__23.35_191; 2771 vect__15.37_204 = vect__33.30_171 * vect__23.36_202; 2772 vect_sum_14.38_205 = _162 + vect__15.37_203; 2773 vect_sum_14.38_206 = vect__15.37_204 + vect_sum_14.38_205; 2774 _208 = .REDUC_PLUS (vect_sum_14.38_206); 2775 niters_vector_mult_vf.24_155 = bnd.23_154 << 2; 2776 _157 = (int) niters_vector_mult_vf.24_155; 2777 tmp.25_156 = i_60 + _157; 2778 if (niters.22_130 == niters_vector_mult_vf.24_155) So there's 1 unaligned_load for index vector(cost 16), and 2 times vec_promote_demote(cost 8), and 8 times vec_to_scalar(cost 160) to get each index for the element. But why do we need that, it's just 8 times scalar_load(cost 128) for index no need to load it as vector and then vec_promote_demote + vec_to_scalar. If we calculate cost model correctly total cost 595 < 640(scalar iterator cost 80 * VF 8), then it's still profitable for ymm gather emulation.