https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69274
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |ra CC| |vmakarov at gcc dot gnu.org --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Ok, I can confirm the bisection to r231814. Differences are RA / scheduling differences like (just the first one), +++ is good, --- is bad: --- 3dview.s 2016-02-04 15:53:21.906672969 +0100 +++ ../build_peak_amd64-m64-gcc42-nn.0000/3dview.s 2016-02-04 15:53:40.5157 56755 +0100 @@ -29,14 +29,14 @@ setnb %al orb %al, %cl je .L2 - vbroadcastss 8(%rsi), %xmm1 - vmulps 32(%rdi), %xmm1, %xmm1 - vbroadcastss 4(%rsi), %xmm0 + vbroadcastss 8(%rsi), %xmm0 + vmulps 32(%rdi), %xmm0, %xmm1 vmovups 48(%rdi), %xmm2 - vfmadd231ps 16(%rdi), %xmm0, %xmm1 - vbroadcastss (%rsi), %xmm0 - vfmadd132ps (%rdi), %xmm2, %xmm0 - vaddps %xmm0, %xmm1, %xmm0 + vbroadcastss 4(%rsi), %xmm0 + vfmadd132ps 16(%rdi), %xmm1, %xmm0 + vbroadcastss (%rsi), %xmm1 + vfmadd132ps (%rdi), %xmm2, %xmm1 + vaddps %xmm1, %xmm0, %xmm0 vmovups %xmm0, (%rdx) ret It's differences all over the place, so profiling is needed here. Will try to get some data on that. IRA dump differences are @@ -578,26 +578,26 @@ cp0:a21(r195)<->a22(r196)@5:shuffle cp1:a20(r197)<->a21(r195)@5:shuffle - cp2:a18(r199)<->a19(r198)@5:shuffle - cp3:a18(r199)<->a20(r197)@5:shuffle + cp2:a18(r199)<->a20(r197)@5:shuffle + cp3:a18(r199)<->a19(r198)@5:shuffle cp4:a16(r200)<->a17(r201)@5:shuffle cp5:a15(r202)<->a16(r200)@5:shuffle - cp6:a13(r204)<->a14(r203)@5:shuffle - cp7:a13(r204)<->a15(r202)@5:shuffle + cp6:a13(r204)<->a15(r202)@5:shuffle + cp7:a13(r204)<->a14(r203)@5:shuffle that doesn't look like useful information to me. Maybe - Forming thread by copy 14:a1r214-a2r213 (freq=5): - Result (freq=160): a1r214(80) a2r213(80) - Pushing a18(r199,l0)(cost 0) + Forming thread by copy 14:a1r214-a3r212 (freq=5): + Result (freq=320): a1r214(80) a3r212(80) a6r210(80) a7r211(80) Pushing a19(r198,l0)(cost 0) - Pushing a13(r204,l0)(cost 0) Pushing a14(r203,l0)(cost 0) - Pushing a8(r209,l0)(cost 0) Pushing a9(r208,l0)(cost 0) - Pushing a1(r214,l0)(cost 0) Pushing a2(r213,l0)(cost 0) Pushing a21(r195,l0)(cost 0) + Pushing a18(r199,l0)(cost 0) Pushing a22(r196,l0)(cost 0) Pushing a20(r197,l0)(cost 0) Pushing a16(r200,l0)(cost 0) + Pushing a13(r204,l0)(cost 0) which looks like spurious ordering differences of same-cost stuff? Completely mysterious why the patch causes so much differences in RA. But the resulting scheduling differences can explain the result (I really suspect just one "unlucky" loop here, will try to track that down now).