> Hi!
>
> On mainline we now use loop versioning and peeling for alignment
> for the following loop (-march=pentium4):
>

we don't yet use loop-versioning in the vectorizer in mainline (we do in
autovect). we do apply peeling.

> void foo3(float * __restrict__ a, float * __restrict__ b,
>      float * __restrict__ c)
> {
>         int i;
>         for (i=0; i<4; ++i)
>                 a[i] = b[i] + c[i];
> }
>
> which results only in slower and larger code.  I also cannot
> see why we zero the mm registers before loading and why we
> load them high/low separated:
>
> .L13:
>         xorps   %xmm1, %xmm1
>         movlps  (%edx,%esi), %xmm1
>         movhps  8(%edx,%esi), %xmm1
>         xorps   %xmm0, %xmm0
>         movlps  (%edx,%ebx), %xmm0
>         movhps  8(%edx,%ebx), %xmm0
>         addps   %xmm0, %xmm1
>         movaps  %xmm1, (%edx,%eax)
>         addl    $1, %ecx
>         addl    $16, %edx
>         cmpl    %ecx, -16(%ebp)
>         ja      .L13
>
>
> but the point is, there is nothing to win vectorizing the loop
> in the first place if we do not know alignment before.
>

The vectorizer is currently greedy - vectorizes as much as it can, no cost
considerations applied yet. Since it is not on by default under any
optimization level, and is relatively new and requires as much testing as
possible, this seemed like a reasonable approach.
Indeed, as we are handling more and more cases (unknown loop bound,
misalignment) and introducing more and more overheads, it is starting to be
imperative to consider cost and size treadoffs. (It's also on the
vectorizer wish-list -
http://gcc.gnu.org/projects/tree-ssa/vectorization.html#vec_todo).

dorit

> Richard.
>

Reply via email to