https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 8 Dec 2017, sergey.shalnov at intel dot com wrote:

> And it uses xmm+ vpbroadcastd to spill tmp[] to stack
> ...
> 1e7:   62 d2 7d 08 7c c9       vpbroadcastd %r9d,%xmm1
>  1ed:   c4 c1 79 7e c9          vmovd  %xmm1,%r9d

^^^
this is an odd instruction ...

>  1f2:   62 f1 fd 08 7f 8c 24    vmovdqa64 %xmm1,-0x38(%rsp)
>  1f9:   c8 ff ff ff 
>  1fd:   62 f2 7d 08 7c d7       vpbroadcastd %edi,%xmm2
>  203:   c5 f9 7e d7             vmovd  %xmm2,%edi
>  207:   62 f1 fd 08 7f 94 24    vmovdqa64 %xmm2,-0x28(%rsp)
>  20e:   d8 ff ff ff 
>  212:   62 f2 7d 08 7c db       vpbroadcastd %ebx,%xmm3
>  218:   c5 f9 7e de             vmovd  %xmm3,%esi
>  21c:   62 f1 fd 08 7f 9c 24    vmovdqa64 %xmm3,-0x18(%rsp)
>  223:   e8 ff ff ff 
>  227:   01 fe                   add    %edi,%esi
>  229:   45 01 c8                add    %r9d,%r8d
>  22c:   41 01 f0                add    %esi,%r8d
>  22f:   8b 5c 24 dc             mov    -0x24(%rsp),%ebx
>  233:   03 5c 24 ec             add    -0x14(%rsp),%ebx
>  237:   8b 6c 24 bc             mov    -0x44(%rsp),%ebp
>  23b:   03 6c 24 cc             add    -0x34(%rsp),%ebp
> ...
> 
> I think this is better in case of performance perspective but, as I said
> before, not using vector registers here is the best option if no loops
> vectorized.

As I said we have a basic-block vectorizer.  Do you propose to remove it?
What's the rationale for "not using vector registers ... if no loops [are]
vectorized"?

With AVX256/512 an additional cost when using vector registers is
a vzeroupper required at function boundaries.  Is this (one of) the
reason?

If it is the case that the instruction sequence

vpbroadcastd %ebx,%xmm3
vmovdqa64 %xmm3,-0x18(%rsp)

is slower than doing

  vmovd %ebx,-0x18(%rsp)
  vmovd %ebx,-0x22(%rsp)
  vmovd %ebx,-0x26(%rsp)
  vmovd %ebx,-0x30(%rsp)

then the costing in the backend needs to reflect that.

I see that the vectorization prevents eliding 'tmp' on GIMPLE
but that's a completely separate issue (the value-numbering we
run after vectorization is poor):

  vect_cst__108 = {_48, _48, _48, _48};
  vect_cst__106 = {_349, _349, _349, _349};
  vect_cst__251 = {_97, _97, _97, _97};
  MEM[(unsigned int *)&tmp + 16B] = vect_cst__251;
  MEM[(unsigned int *)&tmp + 32B] = vect_cst__106;
  MEM[(unsigned int *)&tmp + 48B] = vect_cst__108;
  _292 = tmp[0][0];
  _291 = tmp[1][0];
...

those loads could have been elided but DOM isn't powerful enough to
see that (and never will be).

Reply via email to