http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50713
Richard Guenther <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2011-10-13 CC| |irar at gcc dot gnu.org, | |rth at gcc dot gnu.org Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-13 09:55:59 UTC --- (In reply to comment #0) > in the following code > for "float" > the code generated by "dosum" and "dosuml" differs (dosum better) > for "complex" > "sum" does not vectorize! (a problem in itself) sum vectorizes for me, quite optimally on the tree level: <bb 2>: vect_var_.30_28 = MEM[(struct A *)&a]; vect_var_.35_30 = MEM[(struct A *)&b]; vect_var_.36_31 = vect_var_.35_30 + vect_var_.30_28; MEM[(struct A *)&D.2273] = vect_var_.36_31; return D.2273; the asm isn't well optimized because of the way A is passed by value by the ABI: movq %xmm2, -56(%rsp) movq %xmm3, -48(%rsp) movq %xmm0, -40(%rsp) movq %xmm1, -32(%rsp) it seems the array elements are passed in the lower halfs of separate xmm registers ... likewise for the return value. Passing A by reference probably would fix this. Like A sum(const A &a, const A &b) { A res; res.a[0] = a.a[0] + b.a[0]; res.a[1] = a.a[1] + b.a[1]; res.a[2] = a.a[2] + b.a[2]; res.a[3] = a.a[3] + b.a[3]; return res; } But NRV doesn't seem to kick in here and we still get the mangling for the return value: _Z3sumRK1AS1_: .LFB0: .cfi_startproc movaps (%rdi), %xmm0 addps (%rsi), %xmm0 movaps %xmm0, -56(%rsp) movq -56(%rsp), %rax movaps %xmm0, -24(%rsp) movq %rax, -64(%rsp) movq -16(%rsp), %xmm1 movq -64(%rsp), %xmm0 ret The case with double components has a more sane ABI (passed in memory). Not sure if we can improve the argument extract / return generation for the vectorized float case. Richard? > "dosum" excellent vectorization, dusuml: same issue that with floats Same ABI issue. > if you have time please have a look of what happens with aligned(32) > sse vs avx, float vs doubleā¦ (not an urgent use case at the moment) We currently cannot vectorize complex ops in scalar code that appear in the current way, beause we fail to handle the complex type appearing in a$a$0_46 = MEM[(struct A *)&a]; D.2320_18 = REALPART_EXPR <a$a$0_46>; D.2321_19 = IMAGPART_EXPR <a$a$0_46>; the loop case has D.2355_1 = REALPART_EXPR <a.a[i_14]>; D.2356_12 = IMAGPART_EXPR <a.a[i_14]>; from the start which we handle via interleaved loads. The former could be transformed to the latter during pattern recog. SRA munges the non-loop case but leaves the loop case alone, thus you get vectorization with -fno-tree-sra for some cases.