https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118331
Bug ID: 118331 Summary: Poor code when passing small structs around on 32-bit ARM Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: david at westcontrol dot com Target Milestone: --- I have recently been looking at how small structs are passed around for the 32-bit ARM port (in particular, for Cortex-M devices). This all applies to C and C++, though small structs are more common in C++. The Godbolt link to my test code is here, comparing gcc to clang : <https://godbolt.org/z/aeKrcMb64> (I've tried to write the code in a way that works for C and C++, in case it is helpful.) One common missed optimisation that I see repeatedly is that gcc is making a stack frame unnecessarily when it is returning a small struct. For example, given: #include <stdint.h> typedef struct A2 { uint16_t a; uint16_t b; } A2; A2 makeA2(void) { A2 x = { 1, 0 }; return x; } generates: makeA2: sub sp, sp, #8 movs r0, #1 add sp, sp, #8 bx lr The stack pointer manipulation is superfluous. Even worse stack manipulations can occur when passing structs as parameters: #include <stdint.h> typedef struct B2 { uint32_t a; uint32_t b; } B2; B2 makeB2(void) { B2 x = { 1, 0 }; return x; } void sinkB2(B2 x); void callB2() { B2 x = makeB2(); sinkB2(x); } gives this with gcc: callB2: sub sp, sp, #8 movs r2, #1 movs r3, #0 strd r2, [sp] ldrd r0, r1, [sp] add sp, sp, #8 b sinkB2 and this with clang: callB2: movs r0, #1 movs r1, #0 b sinkB2 gcc is making a stack frame, putting the data in registers, storing that on the stack, then loading it back into registers again! Another example of strange code pessimisations came when I was trying to use vectors to get return values in four gpr registers (instead of the usual one or two): typedef uint32_t C1 __attribute__((vector_size(16))); C1 makeC1(void) { C1 x = { 1 }; return x; } __attribute__((pcs("aapcs"))) C1 makeC1b(void) { C1 x = { 1 }; return x; } gcc gives: makeC1: movs r1, #0 movs r0, #1 mov r2, r1 mov r3, r1 vmov d0, r0, r1 @ int vmov d1, r2, r3 @ int bx lr I have the vector registers enabled (with "-mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16" - needed to get good hardware floating point on that target), so I think it is correct that the SIMD registers d0 and d1 are used here. But then it is unnecessary to put the data in r0:r3 as well. With the "pcs" attribute to disable returning in SIMD registers, gcc returns the data in r0:r3 as expected. However, the code to do so is far from expected: makeC1b: push {r4, r5, r6, r7} movs r7, #0 movs r0, #1 movs r1, #0 movs r2, #0 mov r3, r7 pop {r4, r5, r6, r7} bx lr clang just loads r0-r3 with immediate values in both cases. I suspect that is incorrect according to the ABI for makeC1, even though it is actually nicer code.