[Bug tree-optimization/50713] SLP vs loop: code generated differs

rguenth at gcc dot gnu.org Thu, 13 Oct 2011 02:56:45 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50713


Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-10-13
                 CC|                            |irar at gcc dot gnu.org,
                   |                            |rth at gcc dot gnu.org
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-13 
09:55:59 UTC ---
(In reply to comment #0)
> in the following code 
> for "float" 
> the code generated by "dosum" and "dosuml" differs (dosum better)
> for "complex" 
> "sum" does not vectorize! (a problem in itself)

sum vectorizes for me, quite optimally on the tree level:

<bb 2>:
  vect_var_.30_28 = MEM[(struct A *)&a];
  vect_var_.35_30 = MEM[(struct A *)&b];
  vect_var_.36_31 = vect_var_.35_30 + vect_var_.30_28;
  MEM[(struct A *)&D.2273] = vect_var_.36_31;
  return D.2273;

the asm isn't well optimized because of the way A is passed by value
by the ABI:

        movq    %xmm2, -56(%rsp)
        movq    %xmm3, -48(%rsp)
        movq    %xmm0, -40(%rsp)
        movq    %xmm1, -32(%rsp)

it seems the array elements are passed in the lower halfs of
separate xmm registers ... likewise for the return value.  Passing
A by reference probably would fix this.  Like

A sum(const A &a, const A &b) {
  A res;
  res.a[0] = a.a[0] + b.a[0];
  res.a[1] = a.a[1] + b.a[1];
  res.a[2] = a.a[2] + b.a[2];
  res.a[3] = a.a[3] + b.a[3];
  return res;
}

But NRV doesn't seem to kick in here and we still get the mangling for
the return value:

_Z3sumRK1AS1_:
.LFB0:
        .cfi_startproc
        movaps  (%rdi), %xmm0
        addps   (%rsi), %xmm0
        movaps  %xmm0, -56(%rsp)
        movq    -56(%rsp), %rax
        movaps  %xmm0, -24(%rsp)
        movq    %rax, -64(%rsp)
        movq    -16(%rsp), %xmm1
        movq    -64(%rsp), %xmm0
        ret

The case with double components has a more sane ABI (passed in memory).

Not sure if we can improve the argument extract / return generation
for the vectorized float case.  Richard?

> "dosum" excellent vectorization, dusuml: same issue that with floats

Same ABI issue.

> if you have time please have a look of what happens with aligned(32)
> sse vs avx, float vs double… (not an urgent use case at the moment)

We currently cannot vectorize complex ops in scalar code that appear
in the current way, beause we fail to handle the complex type appearing in

  a$a$0_46 = MEM[(struct A *)&a];
  D.2320_18 = REALPART_EXPR <a$a$0_46>;
  D.2321_19 = IMAGPART_EXPR <a$a$0_46>;

the loop case has

  D.2355_1 = REALPART_EXPR <a.a[i_14]>;
  D.2356_12 = IMAGPART_EXPR <a.a[i_14]>;

from the start which we handle via interleaved loads.  The former could
be transformed to the latter during pattern recog.  SRA munges the
non-loop case but leaves the loop case alone, thus you get vectorization
with -fno-tree-sra for some cases.

[Bug tree-optimization/50713] SLP vs loop: code generated differs

Reply via email to