https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116328
Bug ID: 116328 Summary: Sub-optimal code generated on Arm M0+ when accessing fields using a proxy object Product: gcc Version: 13.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: terrygreeniaus at gmail dot com Target Milestone: --- Godbolt link: https://godbolt.org/z/sjWb56a4E In our embedded product we use a proxy object to describe peripheral registers. Simplified: struct le_reg32_t { volatile uint32_t v; constexpr void operator=(uint32_t _v) {v = _v;} }; // Registers of the TIM14 peripheral. struct _tim14_regs { le_reg32_t cr1; const uint8_t rsrv0[8]; le_reg32_t dier; le_reg32_t sr; le_reg32_t egr; le_reg32_t ccmr1; const uint8_t rsrv1[4]; le_reg32_t ccer; le_reg32_t cnt; le_reg32_t psc; le_reg32_t arr; const uint8_t rsrv2[4]; le_reg32_t ccr1; const uint8_t rsrv3[48]; le_reg32_t tisel; }; The idea is that we can add methods to the object for doing things like extracting or setting fields in a peripheral register. Elsewhere, we define the set of peripherals that can be used: _tim14_regs* const TIM14 = (_tim14_regs*)0x40002000; And then we use the registers as so: void arm_modbus_timer(uint32_t dcnt) { TIM14->cr1 = (0x0000); TIM14->sr = (0x0000); TIM14->cnt = (-dcnt); TIM14->cr1 = (0x0001); } The expectation, when compiled with optimizations enabled, is that the base address for TIM14 gets loaded into a CPU register and then the values get written to the peripheral registers using offsets from the base address. When compiled with the following options, this works great: -O2 -mcpu=cortex-m4 -mfloat-abi=soft However, if we change the CPU to M0+: -O2 -mcpu=cortex-m0plus -mfloat-abi=soft the emitted code now loads the address of each peripheral register individually into a CPU register instead of using a base address and offset. This increases the size of the code and adds extra load instructions and register pressure. If we replace the le_reg32_t struct with a simple: typedef volatile uint32_t le_reg32_t; the extra loads disappear. This is the disassembly on M4 and on M0+ with using the typedef instead of the struct: arm_modbus_timer(unsigned int): movs r2, #0 ldr r3, .L3 rsbs r0, r0, #0 str r2, [r3] str r2, [r3, #16] adds r2, r2, #1 str r0, [r3, #36] str r2, [r3] bx lr .L3: .word 1073750016 This is the disassembly on M0+ when using the struct: arm_modbus_timer(unsigned int): movs r2, #0 ldr r3, .L3 ldr r1, .L3+4 str r2, [r3] str r2, [r1] ldr r2, .L3+8 rsbs r0, r0, #0 str r0, [r2] movs r2, #1 str r2, [r3] bx lr .L3: .word 1073750016 .word 1073750032 .word 1073750052 As you can see, the peripheral register addresses have spilled out into constants embedded in the code instead of being accessed using a single base address and offset. We are building using arm-none-eabi-gcc 13.3.1, but in Godbolt I tested it against various versions and it seems to still have the problem in ARM GCC trunk.