Am 30.08.24 um 14:46 schrieb Richard Biener:
On Fri, Aug 30, 2024 at 2:10 PM Georg-Johann Lay <a...@gjlay.de> wrote:
There are cases, where opportunities to use POST_INC addressing
only occur very late in the compilation process. Take for example
the following function from AVR-LibC's qsort:
void swapfunc (char *a, char *b, int n)
{
do
{
char t = *a;
*a++ = *b;
*b++ = t;
} while (--n > 0);
}
which -mmcu=avrtiny -S -Os -dp compiles to:
swapfunc:
push r28 ; 72 [c=4 l=1] pushqi1/0
push r29 ; 73 [c=4 l=1] pushqi1/0
/* prologue: function */
/* frame size = 0 */
/* stack size = 2 */
mov r26,r24 ; 66 [c=4 l=1] movqi_insn/0
mov r27,r25 ; 67 [c=4 l=1] movqi_insn/0
mov r30,r22 ; 68 [c=4 l=1] movqi_insn/0
mov r31,r23 ; 69 [c=4 l=1] movqi_insn/0
mov r22,r20 ; 70 [c=4 l=1] movqi_insn/0
mov r23,r21 ; 71 [c=4 l=1] movqi_insn/0
.L2:
ld r20,X ; 55 [c=4 l=1] movqi_insn/3
ld r21,Z ; 57 [c=4 l=1] movqi_insn/3
st X,r21 ; 58 [c=4 l=1] movqi_insn/2
subi r26,-1 ; 59 [c=4 l=2] *addhi3_clobber/0
sbci r27,-1
st Z,r20 ; 61 [c=4 l=1] movqi_insn/2
subi r30,-1 ; 62 [c=4 l=2] *addhi3_clobber/0
sbci r31,-1
subi r22, 1 ; 81 [c=4 l=2] *add.for.cczn.hi/0
sbci r23, 0
breq .+2 ; 82 [c=8 l=2] branch_ZN
brpl .L2
/* epilogue start */
pop r29 ; 76 [c=4 l=1] popqi
pop r28 ; 77 [c=4 l=1] popqi
ret ; 78 [c=0 l=1] return_from_epilogue
Insn 56+57 and insns 61+62 are post-inc stores. They are not recognized
because the code prior to cprop_hardreg is a bit of a mess that moves
addresses back and forth (including Y). Only after cprop_hardreg the
code is simple enough so post-inc can be detected by avr-fuse-add.
Hence this patch runs a 2nd instance of that pass late after
cprop_hardreg (the 1st instance runs prior to RTL peephole).
It renames avr_split_tiny_move to avr_split_fake_addressing_move
because that function also splits some insns on non-avrtiny.
The patch removes a define_split that's no more needed because
such splits are performed by avr_split_fake_addressing_move.
Passed without new regressions. Ok for trunk?
Johann
p.s. post-inc etc. optimizations is basically non-existent in GCC.
One reasons seems to be SSA because each SSA variable gets its own
register, which results in a jungle of required registers which
even pass auto-inc-dec cannot penetrate...
Take for example the following code, that would require 12 instructions
as indicated by the comments:
void add4 (uint8_t *aa, const __flash uint8_t *bb, uint8_t nn)
{
// Set Z (R30) to bb (1 MOVW or 2 MOVs)
// Set X (R26) to aa (1 MOVW or 2 MOVs)
do
{
uint8_t sum = 0;
sum += *bb++; // 1 instruction: POST_INC load
sum += *bb++; // 2 instructions: POST_INC load + add
sum += *bb++; // 2 instructions: POST_INC load + add
sum += *bb++; // 2 instructions: POST_INC load + add
*aa++ = sum; // 1 instruction: POST_INC store
} while (--nn); // 2 instructions: dec + branch
}
I think the reason here is weird behavior with __flash and how IVOPTS
tries to enable auto-incdec:
_30 = bb_2 + 1;
_10 = MEM[(const <address-space-1> uint8_t *)_30];
_29 = bb_2 + 2;
_12 = MEM[(const <address-space-1> uint8_t *)_29];
_19 = _10 + _12;
sum_13 = _19 + _9;
bb_14 = bb_2 + 4;
_28 = bb_14 + 65535;
_15 = MEM[(const <address-space-1> uint8_t *)_28];
without __flash you get
_10 = MEM[(const uint8_t *)_27 + 1B];
sum_11 = _9 + _10;
_12 = MEM[(const uint8_t *)_27 + 2B];
sum_13 = sum_11 + _12;
_15 = MEM[(const uint8_t *)_27 + 3B];
which of ocurse is a problem for pre/post-inc addressing as well
but one that I think is solved (well, you can hope...).
Can you open a bugreport with this testcase?
https://gcc.gnu.org/PR116542
Without __flash, the code uses Z, Z+1, Z+2 and Z+3 as addresses,
but still uses one register for access and one register to increment
the address.
Then there are these crazy computation like:
start_address - current_address + nn == 0
as condition for loop termination instead of just dec + branch
of nn.
Johann
But -mmcu=avr4 -Os -S -dp compiles this to madness that
requires more than a dozen registers and computes each
intermediate in its own 16-bit register, coming out with
code that requires 34 instructions instead of 12.
--
AVR: Run pass avr-fuse-add a second time after pass_cprop_hardreg.
gcc/
* config/avr/avr-protos.h (avr_split_tiny_move): Rename to
avr_split_fake_addressing_move.
* config/avr/avr-passes.cc: Same.
(avr_pass_data_fuse_add) <tv_id>: Set to TV_MACH_DEP.
(avr_pass_fuse_add) <clone>: Override.
* config/avr/avr-passes.def (avr_pass_fuse_add): Run again
after pass_cprop_hardreg.
* config/avr/avr.md (split-lpmx): Remove a define_split. Such
splits are performed by avr_split_fake_addressing_move.