Hi Folks, GCC 4.5.1 20100924 "-Os -minline-all-stringops" on Core i7
int main( int argc, char *argv[] ) { int i, a[256], b[256]; for( i = 0; i < 256; ++i ) // discourage optimization a[i] = rand(); memcpy( b, a, argc * sizeof(int) ); printf( "%d\n", b[rand()] ); // discourage optimization return 0; } I wonder if its possible to improve the code generation for inline stringops when the length is known to be a multiple of 4 bytes? That is, instead of: movsx rcx, ebp # argc sal rcx, 2 rep movsb it would be nice to see: movsx rcx, ebp # argc rep movsd Note that memcpy( b, a, 1024 ) generates: mov ecx, 256 rep movsd The reason I think this might be possible is this:- Use -mstringop-strategy=rep_4byte to force the use of movsd. For memcpy( b, a, argc * sizeof(int) ) we get: movsx rcx, ebp # argc sal rcx, 2 cmp rcx, 4 jb .L5 #, shr rcx, 2 rep movsd .L5: For memcpy( b, a, argc ) we get: movsx rax, ebp # argc, argc mov rdi, rsp # tmp76, lea rsi, [rsp+1024] # tmp77, cmp rax, 4 # argc, jb .L3 #, mov rcx, rax # tmp78, argc shr rcx, 2 # tmp78, rep movsd .L3: xor edx, edx # tmp80 test al, 2 # argc, je .L4 #, mov dx, WORD PTR [rsi] # tmp82, mov WORD PTR [rdi], dx #, tmp82 mov edx, 2 # tmp80, .L4: test al, 1 # argc, je .L5 #, mov al, BYTE PTR [rsi+rdx] # tmp85, mov BYTE PTR [rdi+rdx], al #, tmp85 .L5: In the former case (* sizeof(int)) gcc has omitted all the code do deal with 1, 2, and 3 bytes so the stringop code generation has apparently spotted that the length is a multiple of 4 bytes. I can see that the expression code for the length is separate from the stringop stuff. Though it does do the right thing with a literal. Incidentally, for the second case, memcpy( b, a, argc ), the Visual Studio compiler generates code like this: mov eax, ecx shr ecx, 2 rep movsd mov ecx, eax and ecx, 3 rep movsb which seems cleaner (no jumps) than the GCC code, though knowing GCC there is probably a good reason for its choice as it generally seems to have a far more sophisticated optimizer. Best regards, Jeremy