The shrinkwrap optimization added late in GCC 7 allows each callee-save to be delayed and done only across blocks which need a particular callee-save. Although this reduces unnecessary memory traffic on code paths that need few callee-saves, it typically uses LDR/STR rather than LDP/STP. The number of LDP/STP instructions is reduced by ~7%. This means more memory accesses and increased codesize, ~1.0% on average.
To improve this, if a particular callee-save must be saved/restored, also add the adjacent callee-save to allow use of LDP/STP. This significantly reduces codesize (for example gcc_r, povray_r, parest_r, xalancbmk_r are 1% smaller). This is a simple fix which can be backported. A more advanced approach would scan blocks for pairs of callee-saves, but that requires a rewrite of all the callee-save code which is too late at this stage. An example epilog in a shrinkwrapped function before: ldp x21, x22, [sp,#16] ldr x23, [sp,#32] ldr x24, [sp,#40] ldp x25, x26, [sp,#48] ldr x27, [sp,#64] ldr x28, [sp,#72] ldr x30, [sp,#80] ldr d8, [sp,#88] ldp x19, x20, [sp],#96 ret And after this patch: ldr d8, [sp,#88] ldp x21, x22, [sp,#16] ldp x23, x24, [sp,#32] ldp x25, x26, [sp,#48] ldp x27, x28, [sp,#64] ldr x30, [sp,#80] ldp x19, x20, [sp],#96 ret Passes bootstrap, OK for commit (and backport to GCC7)? ChangeLog: 2018-01-05 Wilco Dijkstra <wdijk...@arm.com> * config/aarch64/aarch64.c (aarch64_components_for_bb): Increase LDP/STP opportunities by adding adjacent callee-saves. -- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9735fc18402dd8fe2fa4022eef4c0522814a0552..da21032b19413d0361b8d30b51a31124eaaa31a1 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -3503,7 +3503,22 @@ aarch64_components_for_bb (basic_block bb) && (bitmap_bit_p (in, regno) || bitmap_bit_p (gen, regno) || bitmap_bit_p (kill, regno))) - bitmap_set_bit (components, regno); + { + unsigned regno2, offset, offset2; + bitmap_set_bit (components, regno); + + /* If there is a callee-save at an adjacent offset, add it too + to increase the use of LDP/STP. */ + offset = cfun->machine->frame.reg_offset[regno]; + regno2 = ((offset & 8) == 0) ? regno + 1 : regno - 1; + + if (regno2 <= LAST_SAVED_REGNUM) + { + offset2 = cfun->machine->frame.reg_offset[regno2]; + if ((offset & ~8) == (offset2 & ~8)) + bitmap_set_bit (components, regno2); + } + } return components; }