https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69873

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2016-02-19
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 37740
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37740&action=edit
patch

Patch fixing this.  Also fixing some incosistencies in cost estimation which
unfortunately "improves" the threshold to > 6 iterations.

Issues with the cost model include the odd separation between taken vs.
not-taken
branch cost (as opposed to fallthru, not fallthru or well predicted vs. not
predicted).  This artificially raises the cost of the vectorized path (on
x86_64
generic model, vectorized path has a "taken" branch of cost 3 while
unvectorized
has an "not taken" branch of cost 1).

Another issue is the prologue cost for emitting the vector double constant
{ 1., 1. }.  The scalar version also has a constant pool entry but we assume
none exist for the scalar path - reasonable only if insns with immediate forms
are available which IMHO is not reasonable to assume for FP modes.  This skews
the cost by one.

If you disable cunroll then generated code is

        cmpl    $6, %edi
        jbe     .L3
        subl    $2, %edi
        movapd  .LC0(%rip), %xmm0
        shrl    %edi
        xorl    %eax, %eax
        leal    1(%rdi), %ecx
.L4:
        movq    %rax, %rdx
        addq    $1, %rax
        salq    $4, %rdx
        cmpl    %eax, %ecx
        movaps  %xmm0, a(%rdx)
        ja      .L4

        rep ret
.L3:
        movsd   .LC1(%rip), %xmm0
        xorl    %eax, %eax
.L6:
        movsd   %xmm0, a(,%rax,8)
        addq    $1, %rax
        cmpl    %eax, %edi
        ja      .L6
.L1:
        rep ret

so the stupid IV choice makes the runtime check reasonable.

Reply via email to