https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69873
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2016-02-19 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 37740 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37740&action=edit patch Patch fixing this. Also fixing some incosistencies in cost estimation which unfortunately "improves" the threshold to > 6 iterations. Issues with the cost model include the odd separation between taken vs. not-taken branch cost (as opposed to fallthru, not fallthru or well predicted vs. not predicted). This artificially raises the cost of the vectorized path (on x86_64 generic model, vectorized path has a "taken" branch of cost 3 while unvectorized has an "not taken" branch of cost 1). Another issue is the prologue cost for emitting the vector double constant { 1., 1. }. The scalar version also has a constant pool entry but we assume none exist for the scalar path - reasonable only if insns with immediate forms are available which IMHO is not reasonable to assume for FP modes. This skews the cost by one. If you disable cunroll then generated code is cmpl $6, %edi jbe .L3 subl $2, %edi movapd .LC0(%rip), %xmm0 shrl %edi xorl %eax, %eax leal 1(%rdi), %ecx .L4: movq %rax, %rdx addq $1, %rax salq $4, %rdx cmpl %eax, %ecx movaps %xmm0, a(%rdx) ja .L4 rep ret .L3: movsd .LC1(%rip), %xmm0 xorl %eax, %eax .L6: movsd %xmm0, a(,%rax,8) addq $1, %rax cmpl %eax, %edi ja .L6 .L1: rep ret so the stupid IV choice makes the runtime check reasonable.