Hello!

Consider this simple testcase:

#define N 16

short ia[N];
short ic[N] = {0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45};
short ib[N] = {0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45};


int main ()
{
  int i;

  for (i = 0; i < N; i++)
    ia[i] = ib[i] + ic[i];

  return 0;
}

The loop in this testcase is compiled with 'gcc -O2 -ftree-vectorize -msse2' 
into:

.L2:
        movdqa  ib(%eax), %xmm0
        paddw   ic(%eax), %xmm0
        incl    %edx
        movdqa  %xmm0, ia(%eax)
        addl    $16, %eax
        cmpl    $2, %edx
        jne     .L2

There is no (,%reg,16) SIB mode available in i386, and it looks to me that loop 
optimizer fallbacks to the most simple addressing mode in this case. 
Unfortunatelly, %edx register is wasted in above code.

A better code would be:

.L2:
        movdqa  ib(,%eax,8), %xmm0
        paddw   ic(,%eax,8), %xmm0
        movdqa  %xmm0, ia(,%eax,8)
        addl    $2, %eax
        cmpl    $4, %eax
        jne     .L2

or with the simplest addressing scheme:

.L2:
        movdqa  ib(%eax), %xmm0
        paddw   ic(%eax), %xmm0
        movdqa  %xmm0, ia(%eax)
        addl    $16, %eax
        cmpl    $32, %eax
        jne     .L2

Uros.

-- 
           Summary: A register is wasted in simple vectorised loops
           Product: gcc
           Version: 4.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P2
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: uros at kss-loka dot si
                CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22497

Reply via email to