https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
Bug ID: 88440
Summary: size optimization of memcpy-like code
Product: gcc
Version: 8.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hoganmeier at gmail dot com
Target Milestone: ---
https://godbolt.org/z/RTji7B
void foo(char* restrict dst, const char* buf) {
for (int i=0; i<8; ++i)
*dst++ = *buf++;
}
$ gcc -Os
$ gcc -O2
.L2:
mov dl, BYTE PTR [rsi+rax]
mov BYTE PTR [rdi+rax], dl
inc rax
cmp rax, 8
jne .L2
$ gcc -O3
mov rax, QWORD PTR [rsi]
mov QWORD PTR [rdi], rax
$ arm-none-eabi-gcc -O3 -mthumb -mcpu=cortex-m4
ldr r3, [r1] @ unaligned
ldr r2, [r1, #4] @ unaligned
str r2, [r0, #4] @ unaligned
str r3, [r0] @ unaligned
The -O3 code is both faster and smaller for both ARM and x64:
"note: Loop 1 distributed: split to 0 loops and 1 library calls."
Should be considered for -O2 and -Os as well.