https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118331

            Bug ID: 118331
           Summary: Poor code when passing small structs around on 32-bit
                    ARM
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: david at westcontrol dot com
  Target Milestone: ---

I have recently been looking at how small structs are passed around for the
32-bit ARM port (in particular, for Cortex-M devices).  This all applies to C
and C++, though small structs are more common in C++.  The Godbolt link to my
test code is here, comparing gcc to clang :

<https://godbolt.org/z/aeKrcMb64>

(I've tried to write the code in a way that works for C and C++, in case it is
helpful.)

One common missed optimisation that I see repeatedly is that gcc is making a
stack frame unnecessarily when it is returning a small struct.  For example,
given:

    #include <stdint.h>

    typedef struct A2 { uint16_t a; uint16_t b; } A2;
    A2 makeA2(void) { A2 x = { 1, 0 }; return x; }

generates:

    makeA2:
        sub     sp, sp, #8
        movs    r0, #1
        add     sp, sp, #8
        bx      lr

The stack pointer manipulation is superfluous.

Even worse stack manipulations can occur when passing structs as parameters:

    #include <stdint.h>

    typedef struct B2 { uint32_t a; uint32_t b; } B2;
    B2 makeB2(void) { B2 x = { 1, 0 }; return x; }
    void sinkB2(B2 x);
    void callB2() { B2 x = makeB2(); sinkB2(x); }

gives this with gcc:

    callB2:
        sub     sp, sp, #8
        movs    r2, #1
        movs    r3, #0
        strd    r2, [sp]
        ldrd    r0, r1, [sp]
        add     sp, sp, #8
        b       sinkB2

and this with clang:

    callB2:
        movs    r0, #1
        movs    r1, #0
        b       sinkB2


gcc is making a stack frame, putting the data in registers, storing that on the
stack, then loading it back into registers again!


Another example of strange code pessimisations came when I was trying to use
vectors to get return values in four gpr registers (instead of the usual one or
two):

    typedef uint32_t C1 __attribute__((vector_size(16)));
    C1 makeC1(void) { C1 x = { 1 }; return x; }
    __attribute__((pcs("aapcs"))) C1 makeC1b(void) { C1 x = { 1 }; return x; }

gcc gives:

    makeC1:
        movs    r1, #0
        movs    r0, #1
        mov     r2, r1
        mov     r3, r1
        vmov    d0, r0, r1      @ int
        vmov    d1, r2, r3      @ int
        bx      lr

I have the vector registers enabled (with "-mcpu=cortex-m7 -mfloat-abi=hard
-mfpu=fpv5-d16" - needed to get good hardware floating point on that target),
so I think it is correct that the SIMD registers d0 and d1 are used here.  But
then it is unnecessary to put the data in r0:r3 as well.  With the "pcs"
attribute to disable returning in SIMD registers, gcc returns the data in r0:r3
as expected.  However, the code to do so is far from expected:

    makeC1b:
        push    {r4, r5, r6, r7}
        movs    r7, #0
        movs    r0, #1
        movs    r1, #0
        movs    r2, #0
        mov     r3, r7
        pop     {r4, r5, r6, r7}
        bx      lr

clang just loads r0-r3 with immediate values in both cases.  I suspect that is
incorrect according to the ABI for makeC1, even though it is actually nicer
code.

Reply via email to