Hi Folks,

GCC 4.5.1 20100924 "-Os -minline-all-stringops"  on Core i7

int
main( int argc, char *argv[] )
{
  int i, a[256], b[256];

  for( i = 0; i < 256; ++i )  // discourage optimization
        a[i] = rand();

  memcpy( b, a, argc * sizeof(int) );

  printf( "%d\n", b[rand()] );  // discourage optimization

  return 0;
}

I wonder if its possible to improve the code generation for inline
stringops when
the length is known to be a multiple of 4 bytes?

That is, instead of:

        movsx   rcx, ebp    # argc
        sal rcx, 2
        rep movsb

it would be nice to see:

        movsx   rcx, ebp    # argc
        rep movsd

Note that  memcpy( b, a, 1024 ) generates:

        mov ecx, 256
        rep movsd

The reason I think this might be possible is this:-

Use -mstringop-strategy=rep_4byte to force the use of movsd.

For memcpy( b, a, argc * sizeof(int) ) we get:

        movsx   rcx, ebp    # argc
        sal rcx, 2
        cmp rcx, 4
        jb  .L5 #,
        shr rcx, 2
        rep movsd
.L5:


For memcpy( b, a, argc ) we get:

        movsx   rax, ebp    # argc, argc
        mov rdi, rsp    # tmp76,
        lea rsi, [rsp+1024] # tmp77,
        cmp rax, 4  # argc,
        jb  .L3 #,
        mov rcx, rax    # tmp78, argc
        shr rcx, 2  # tmp78,
        rep movsd
.L3:
        xor edx, edx    # tmp80
        test    al, 2   # argc,
        je  .L4 #,
        mov dx, WORD PTR [rsi]  # tmp82,
        mov WORD PTR [rdi], dx  #, tmp82
        mov edx, 2  # tmp80,
.L4:
        test    al, 1   # argc,
        je  .L5 #,
        mov al, BYTE PTR [rsi+rdx]  # tmp85,
        mov BYTE PTR [rdi+rdx], al  #, tmp85
.L5:

In the former case (* sizeof(int)) gcc has omitted all the code do deal with 1,
2, and 3 bytes so the stringop code generation has apparently spotted
that the length
is a multiple of 4 bytes.

I can see that the expression code for the length is separate from the stringop
stuff.  Though it does do the right thing with a literal.

Incidentally, for the second case, memcpy( b, a, argc ), the Visual Studio
compiler generates code like this:

        mov eax, ecx
        shr ecx, 2
        rep movsd
        mov ecx, eax
        and ecx, 3
        rep movsb

which seems cleaner (no jumps) than the GCC code, though knowing GCC there is
probably a good reason for its choice as it generally seems to have a far more
sophisticated optimizer.

Best regards,
Jeremy

Reply via email to