the following two functions are equivalent, adding a scalar to a vector, using
a manual loop unrolling of 8 (2 sse vectors).

the first function serializes the operation, while the second function
interleaves the instructions for two operations:

void bench_3(float * out, float * in, float f, unsigned int n)
{
    n /= 8;
    __m128 scalar = _mm_set_ps1(f);
    do
    {
        __m128 arg = _mm_load_ps(in);
        __m128 result = _mm_add_ps(arg, scalar);
        _mm_store_ps(out, result);

        arg = _mm_load_ps(in+4);
        result = _mm_add_ps(arg, scalar);
        _mm_store_ps(out+4, result);
        in += 8;
        out += 8;
    }
    while (--n);
}

with the generated code:
.L13:
        movaps  (%rsi,%rax), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%rdi,%rax)
        movaps  16(%rsi,%rax), %xmm0
        addps   %xmm1, %xmm0
        movaps  %xmm0, 16(%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L13


void bench_4(float * out, float * in, float f, unsigned int n)
{
    n /= 8;
    __m128 scalar = _mm_set_ps1(f);
    do
    {
        __m128 arg  = _mm_load_ps(in);
        __m128 arg2 = _mm_load_ps(in+4);
        __m128 result  = _mm_add_ps(arg, scalar);
        __m128 result2 = _mm_add_ps(arg2, scalar);
        _mm_store_ps(out, result);
        _mm_store_ps(out+4, result2);
        in += 8;
        out += 8;
    }
    while (--n);
}

generated code:
.L9:
        movaps  (%rsi,%rax), %xmm0
        movaps  16(%rsi,%rax), %xmm1
        addps   %xmm2, %xmm0
        addps   %xmm2, %xmm1
        movaps  %xmm0, (%rdi,%rax)
        movaps  %xmm1, 16(%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L9

the interleaved code outperforms the sequential code by about 12% on
x86_64/core2, possibly, because the instruction pairs (load/add/store) don't
have any data dependencies.
it would be nice, if gcc could do a register renaming and instruction
reordering on the first function to generate the same instructions than the
second function.


-- 
           Summary: missed optimization: register renaming in unrolled loop
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tim at klingt dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38825

Reply via email to