https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716
Bug ID: 92716 Summary: -Os doesn't inline byteswap function even though it's a single instruction Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jwerner at chromium dot org Target Milestone: --- I compiled the following test code for both x86_64 and aarch64 on gcc 8.3.0: static inline unsigned int byteswap(unsigned int x) { return (((x >> 24) & 0xff) << 0) | (((x >> 16) & 0xff) << 8) | (((x >> 8) & 0xff) << 16) | (((x >> 0) & 0xff) << 24); } unsigned int test(unsigned int a, unsigned int b, unsigned int c) { return byteswap(a) + byteswap(b) + byteswap(c); } On x86_64 I get: 0000000000000000 <byteswap> (File Offset: 0x40): 0: 89 f8 mov %edi,%eax 2: 0f c8 bswap %eax 4: c3 retq 0000000000000005 <test> (File Offset: 0x45): 5: e8 f6 ff ff ff callq 0 <byteswap> (File Offset: 0x40) a: 89 f7 mov %esi,%edi c: 89 c1 mov %eax,%ecx e: e8 ed ff ff ff callq 0 <byteswap> (File Offset: 0x40) 13: 89 d7 mov %edx,%edi 15: 01 c1 add %eax,%ecx 17: e8 e4 ff ff ff callq 0 <byteswap> (File Offset: 0x40) 1c: 01 c8 add %ecx,%eax 1e: c3 retq And on aarch64 I get: 0000000000000000 <byteswap> (File Offset: 0x40): 0: 5ac00800 rev w0, w0 4: d65f03c0 ret 0000000000000008 <test> (File Offset: 0x48): 8: a9bf7bfd stp x29, x30, [sp,#-16]! c: 910003fd mov x29, sp 10: 97fffffc bl 0 <byteswap> (File Offset: 0x40) 14: 2a0003e3 mov w3, w0 18: 2a0103e0 mov w0, w1 1c: 97fffff9 bl 0 <byteswap> (File Offset: 0x40) 20: 0b000063 add w3, w3, w0 24: 2a0203e0 mov w0, w2 28: 97fffff6 bl 0 <byteswap> (File Offset: 0x40) 2c: 0b000060 add w0, w3, w0 30: a8c17bfd ldp x29, x30, [sp],#16 34: d65f03c0 ret So the good news is that GCC recognized this code as a byteswap function that can be implemented with a single instruction on both of these platforms. The bad news is that it then doesn't seem to realize that inlining this single instruction leads to smaller code size than wrapping it in a function and calling it, even if it is called many times. If I instead compile with -O2, the function is inlined as expected. (I also tried with clang 8.0.1 which manages to inline correctly even with -Os.)