https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 8 Dec 2017, sergey.shalnov at intel dot com wrote: > And it uses xmm+ vpbroadcastd to spill tmp[] to stack > ... > 1e7: 62 d2 7d 08 7c c9 vpbroadcastd %r9d,%xmm1 > 1ed: c4 c1 79 7e c9 vmovd %xmm1,%r9d ^^^ this is an odd instruction ... > 1f2: 62 f1 fd 08 7f 8c 24 vmovdqa64 %xmm1,-0x38(%rsp) > 1f9: c8 ff ff ff > 1fd: 62 f2 7d 08 7c d7 vpbroadcastd %edi,%xmm2 > 203: c5 f9 7e d7 vmovd %xmm2,%edi > 207: 62 f1 fd 08 7f 94 24 vmovdqa64 %xmm2,-0x28(%rsp) > 20e: d8 ff ff ff > 212: 62 f2 7d 08 7c db vpbroadcastd %ebx,%xmm3 > 218: c5 f9 7e de vmovd %xmm3,%esi > 21c: 62 f1 fd 08 7f 9c 24 vmovdqa64 %xmm3,-0x18(%rsp) > 223: e8 ff ff ff > 227: 01 fe add %edi,%esi > 229: 45 01 c8 add %r9d,%r8d > 22c: 41 01 f0 add %esi,%r8d > 22f: 8b 5c 24 dc mov -0x24(%rsp),%ebx > 233: 03 5c 24 ec add -0x14(%rsp),%ebx > 237: 8b 6c 24 bc mov -0x44(%rsp),%ebp > 23b: 03 6c 24 cc add -0x34(%rsp),%ebp > ... > > I think this is better in case of performance perspective but, as I said > before, not using vector registers here is the best option if no loops > vectorized. As I said we have a basic-block vectorizer. Do you propose to remove it? What's the rationale for "not using vector registers ... if no loops [are] vectorized"? With AVX256/512 an additional cost when using vector registers is a vzeroupper required at function boundaries. Is this (one of) the reason? If it is the case that the instruction sequence vpbroadcastd %ebx,%xmm3 vmovdqa64 %xmm3,-0x18(%rsp) is slower than doing vmovd %ebx,-0x18(%rsp) vmovd %ebx,-0x22(%rsp) vmovd %ebx,-0x26(%rsp) vmovd %ebx,-0x30(%rsp) then the costing in the backend needs to reflect that. I see that the vectorization prevents eliding 'tmp' on GIMPLE but that's a completely separate issue (the value-numbering we run after vectorization is poor): vect_cst__108 = {_48, _48, _48, _48}; vect_cst__106 = {_349, _349, _349, _349}; vect_cst__251 = {_97, _97, _97, _97}; MEM[(unsigned int *)&tmp + 16B] = vect_cst__251; MEM[(unsigned int *)&tmp + 32B] = vect_cst__106; MEM[(unsigned int *)&tmp + 48B] = vect_cst__108; _292 = tmp[0][0]; _291 = tmp[1][0]; ... those loads could have been elided but DOM isn't powerful enough to see that (and never will be).