https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116328

            Bug ID: 116328
           Summary: Sub-optimal code generated on Arm M0+ when accessing
                    fields using a proxy object
           Product: gcc
           Version: 13.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: terrygreeniaus at gmail dot com
  Target Milestone: ---

Godbolt link: https://godbolt.org/z/sjWb56a4E

In our embedded product we use a proxy object to describe peripheral registers.
 Simplified:

    struct le_reg32_t
    {
        volatile uint32_t v;
        constexpr void operator=(uint32_t _v) {v = _v;}
    };

    // Registers of the TIM14 peripheral.
    struct _tim14_regs
    {
        le_reg32_t      cr1;
        const uint8_t   rsrv0[8];
        le_reg32_t      dier;
        le_reg32_t      sr;
        le_reg32_t      egr;
        le_reg32_t      ccmr1;
        const uint8_t   rsrv1[4];
        le_reg32_t      ccer;
        le_reg32_t      cnt;
        le_reg32_t      psc;
        le_reg32_t      arr;
        const uint8_t   rsrv2[4];
        le_reg32_t      ccr1;
        const uint8_t   rsrv3[48];
        le_reg32_t      tisel;
    };

The idea is that we can add methods to the object for doing things like
extracting or setting fields in a peripheral register.  Elsewhere, we define
the set of peripherals that can be used:

    _tim14_regs* const TIM14 = (_tim14_regs*)0x40002000;

And then we use the registers as so:

    void
    arm_modbus_timer(uint32_t dcnt)
    {
        TIM14->cr1 = (0x0000);
        TIM14->sr  = (0x0000);
        TIM14->cnt = (-dcnt);
        TIM14->cr1 = (0x0001);
    }

The expectation, when compiled with optimizations enabled, is that the base
address for TIM14 gets loaded into a CPU register and then the values get
written to the peripheral registers using offsets from the base address.  When
compiled with the following options, this works great:

    -O2
    -mcpu=cortex-m4
    -mfloat-abi=soft

However, if we change the CPU to M0+:

    -O2
    -mcpu=cortex-m0plus
    -mfloat-abi=soft

the emitted code now loads the address of each peripheral register individually
into a CPU register instead of using a base address and offset.  This increases
the size of the code and adds extra load instructions and register pressure.

If we replace the le_reg32_t struct with a simple:

    typedef volatile uint32_t le_reg32_t;

the extra loads disappear.

This is the disassembly on M4 and on M0+ with using the typedef instead of the
struct:

    arm_modbus_timer(unsigned int):
            movs    r2, #0
            ldr     r3, .L3
            rsbs    r0, r0, #0
            str     r2, [r3]
            str     r2, [r3, #16]
            adds    r2, r2, #1
            str     r0, [r3, #36]
            str     r2, [r3]
            bx      lr
    .L3:
            .word   1073750016


This is the disassembly on M0+ when using the struct:

    arm_modbus_timer(unsigned int):
            movs    r2, #0
            ldr     r3, .L3
            ldr     r1, .L3+4
            str     r2, [r3]
            str     r2, [r1]
            ldr     r2, .L3+8
            rsbs    r0, r0, #0
            str     r0, [r2]
            movs    r2, #1
            str     r2, [r3]
            bx      lr
    .L3:
            .word   1073750016
            .word   1073750032
            .word   1073750052

As you can see, the peripheral register addresses have spilled out into
constants embedded in the code instead of being accessed using a single base
address and offset.

We are building using arm-none-eabi-gcc 13.3.1, but in Godbolt I tested it
against various versions and it seems to still have the problem in ARM GCC
trunk.
  • [Bug c++/116328] New: Sub-opt... terrygreeniaus at gmail dot com via Gcc-bugs

Reply via email to