https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2016-08-12 Target Milestone|7.0 |--- Summary|[5/6/7] Tree-sra forces |SRA forces parameters to |parameters to memory |memory causing awful code |causing awful code |generation |generation | Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- The issue is not SRA pushing things to memory - it doesn't. The issue is that in the GIMPLE IL the parameter appears as "memory" as it is an aggregate type. The issue is really RTL expansion of the argument (not sure where that's done) which doesn't take into account that we could happily expand a$vx0_27 = MEM[(struct *)&a]; a$vx1_28 = MEM[(struct *)&a + 16B]; a$vx2_29 = MEM[(struct *)&a + 32B]; a$vx3_30 = MEM[(struct *)&a + 48B]; if a is expanded to registers. Note comparing assembler with/without -fno-tree-sra doesn't really show me the obvious badness: test_vecd8_rotate_left: - addi 6,1,-288 - li 9,160 + addi 6,1,-224 + li 7,96 xxpermdi 34,34,34,2 xxpermdi 35,35,35,2 - li 10,176 + li 8,112 + li 10,128 xxpermdi 36,36,36,2 xxpermdi 37,37,37,2 + stxvd2x 34,6,7 + li 9,144 + vspltisw 0,0 + stxvd2x 35,6,8 + stxvd2x 36,6,10 + stxvd2x 37,6,9 + lxvd2x 0,6,7 li 7,32 - stxvd2x 34,6,9 + lxvd2x 7,6,8 li 8,48 - stxvd2x 35,6,10 - li 10,192 - stxvd2x 36,6,10 - li 10,208 - stxvd2x 37,6,10 - lfd 12,-128(1) - lxvd2x 0,6,9 - li 9,96 + lxvd2x 9,6,10 li 10,64 - stfd 12,-184(1) - lfd 12,-104(1) - xxpermdi 0,0,0,3 - stfd 0,-144(1) - stfd 12,-192(1) - lfd 12,-112(1) - stfd 12,-168(1) - lfd 12,-88(1) - lxvd2x 10,6,9 - li 9,112 - stxvd2x 10,6,7 - stfd 12,-176(1) - lfd 12,-96(1) - stfd 12,-152(1) - lfd 12,-72(1) lxvd2x 11,6,9 - li 9,128 - lxvd2x 34,6,7 - stxvd2x 11,6,8 - stfd 12,-160(1) - lfd 12,-80(1) - xxpermdi 34,34,34,2 - stfd 12,-136(1) - lxvd2x 12,6,9 - li 9,144 - lxvd2x 35,6,8 - stxvd2x 12,6,10 - lxvd2x 0,6,9 li 9,80 - xxpermdi 35,35,35,2 + fmr 6,0 + xxpermdi 0,0,0,3 + fmr 8,7 + xxpermdi 7,7,7,3 + fmr 10,9 + xxpermdi 9,9,9,3 + fmr 12,11 + xxpermdi 11,11,11,3 + xxpermdi 6,6,32,1 + xxpermdi 8,8,32,1 + xxpermdi 10,10,32,1 + xxpermdi 7,6,7,0 + xxpermdi 12,12,32,1 + xxpermdi 9,8,9,0 + xxpermdi 11,10,11,0 + xxpermdi 8,7,7,2 + xxpermdi 0,12,0,0 + xxpermdi 10,9,9,2 + stxvd2x 8,6,7 + xxpermdi 12,11,11,2 + stxvd2x 10,6,8 + xxpermdi 0,0,0,2 + stxvd2x 12,6,10 stxvd2x 0,6,9 + lxvd2x 34,6,7 + lxvd2x 35,6,8 lxvd2x 36,6,10 lxvd2x 37,6,9 + xxpermdi 35,35,35,2 + xxpermdi 34,34,34,2 xxpermdi 36,36,36,2 xxpermdi 37,37,37,2 blr t.s | 80 ++++++++++++++++++++++++++++++++++---------------------------------- 1 file changed, 40 insertions(+), 40 deletions(-) Generally the GIMPLE phase not knowing about calling conventions has issues but I don't see an easy way out here other than lowering things much much earlier ... (with its own downside, esp. considering the "cross-target" games we play for offloading via LTO). That said, somebody needs to sit down and see where the copy to memory is generated and why.