I'm writing code for an ATtiny (tiny84 specifically). I currently have an array and a function defined thus:
static struct { uint8_t state; uint8_t v1; } leds[3]; uint8_t led_state(uint8_t i) { return leds[i].state; } Elements of this array are 2 bytes long, and this function compiles to some fairly sensible looking code: 000003b8 <led_state>: 3b8: e8 2f mov r30, r24 3ba: f0 e0 ldi r31, 0x00 ; 0 3bc: ee 0f add r30, r30 3be: ff 1f adc r31, r31 3c0: e1 57 subi r30, 0x71 ; 113 3c2: ff 4f sbci r31, 0xFF ; 255 3c4: 80 81 ld r24, Z 3c6: 08 95 ret The instruction at 3bc adds r30 to itself, effectively causing it to contain 2*i, which is the offset in bytes required to be added to the base address of the `leds` array to find the required field. If instead I define my array thus: static struct { uint8_t state; uint16_t v1; } leds[3]; I get the most horrible: 000003be <led_state>: 3be: 90 e0 ldi r25, 0x00 ; 0 3c0: 63 e0 ldi r22, 0x03 ; 3 3c2: 70 e0 ldi r23, 0x00 ; 0 3c4: 2d d3 rcall .+1626 ; 0xa20 <__mulhi3> 3c6: 81 57 subi r24,0x71 ; 113 3c8: 9f 4f sbci r25, 0xFF ; 255 3ca: fc 01 movw r30, r24 3cc: 80 81 ld r24, Z 3ce: 08 95 ret Here, gcc has given up trying to perform a non power-of-two multiplication and just inlined a call to the generic __mulhi3 function instead. This is somewhat disappointing, as if I further expand my array with a dummy field for padding, I now get: 000003be <led_state>: 3be: e8 2f mov r30, r24 3c0: f0 e0 ldi r31, 0x00 ; 0 3c2: ee 0f add r30, r30 3c4: ff 1f adc r31, r31 3c6: ee 0f add r30, r30 3c8: ff 1f adc r31, r31 3ca: e1 57 subi r30, 0x71 ; 113 3cc: ff 4f sbci r31, 0xFF ; 255 3ce: 80 81 ld r24, Z 3d0: 08 95 ret because it has neatly managed to perform that power-of-two multiplication, this time of 4*i. I feel this is a situation that could be resolved better; for instance, a multiplication by 3 in the middle case could have been performed by mov r30, r24 ldi r31, 0x00 ; Z == i add r30, r30 adc r31, r31 ; Z == 2*i add r30, r24 adc r31, r1 ; Z == 2*i + i == 3*i in just as many instructions as the 4*i case, without my having to waste extra bytes of precious RAM just to provide that dummy padding. Furthermore, I care a lot about instruction counts to access because this LED management code runs inside a timer match interrupt to implement software PWM, so it has to be fast and tight. As a workaround for now, I can restructure my code into two separate arrays of integers, instead of one array of two-int structs, and thus ensure that all arrays contain only a power-of-two number of bytes, so all accesses will be efficient. However, it would be nice if gcc could manage these multiplications a little nicer. ----- As another side note, consider the massive difference between uint8_t mul10(uint8_t x) { return x * 10; } 3be: 6a e0 ldi r22, 0x0A ; 10 3c0: 29 d3 rcall .+1618 ; 0xa14 <__mulqi3> 3c2: 08 95 ret vs. something like mov r0, r24 ; r0 = x add r24, r24 ; r24 = 2*x add r24, r24 ; r24 = 4*x add r24, r0 ; r24 = 5*x add r24, r24 ; r24 = 10*x ret ----- Is there currently underway an attempt to provide better optimisation of these sorts of cases? If not, where/how could I contribute such? -- Paul "LeoNerd" Evans leon...@leonerd.org.uk http://www.leonerd.org.uk/ | https://metacpan.org/author/PEVANS
pgp4uvgiC4pMd.pgp
Description: OpenPGP digital signature
_______________________________________________ AVR-GCC-list mailing list AVR-GCC-list@nongnu.org https://lists.nongnu.org/mailman/listinfo/avr-gcc-list