Ping
On Wed, Nov 19, 2014 at 10:43 AM, Joey Ye <joey...@arm.com> wrote: > Current thumb2 -Os generates suboptimal code for following tail call case: > > int f4(int b, int a, int c, int d); > int g(int a, int b, int c, int d) > { return f4(b, a, c, d); } > > arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m3 test.c > > push > {r4, lr} > mov r4, r1 > mov r1, r0 > mov r0, r4 > pop {r4, lr} > > b f4 > > There are two issues: The first one is that saving/restoring lr is not > necessary, as there is no return via pop pc. The second one is that even if > we managed to avoid lr push/pop, ldmia.w sp!, {r4} is still emitted as there > is a missing pattern for pop single and code size is not optimal. > > This patch fixes these two issues and introduces a shared test case. CSiBE > thumb2 -Os shows cross board code size reduction, except for one case with 4 > bytes regression. The case is like: > > void f () > { > if () > ... > else if () > ... > else g(); > } > > There are N=2 non-sibcall returns and S=1 sibcall return. Originally the > non-sibcall returns are just pop {r4, r5, pc}, now they become > b.n .Lreturn > > .Lreturn: > pop {r4, r5} > bx lr > > The one byte save from sibcall return does not win the non-sibcall return > regressions back. In general scenario, number of N non-sibcall returns use > b.n branching to merged tail, number of S sibcalls save 2 bytes by avoid > poping lr. It results in 4-2*S bytes regression. In the worst scenario, each > non-sibcall return has to use b.w branching to merged tail, resulting in > (N-S)*2 bytes regression. The worst scenario is rare, according to CSiBE. > The general regression scenario can only regress 2 bytes at most. So I would > not introduce additional complexity to handle the regression case. > > Make check cortex-m3: pass > thumb2 bootstrap (O2/Os): pass > > * config/arm/arm.c (arm_compute_save_reg_mask): > Do not save lr in case of tail call. > * config/arm/thumb2.md (*thumb2_pop_single): New pattern. > > * gcc.target/arm/thumb2-pop-single.c: New test. > > diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c > index 4f04707..20d0b9e 100644 > --- a/gcc/config/arm/arm.c > +++ b/gcc/config/arm/arm.c > @@ -19190,6 +19190,7 @@ arm_compute_save_reg_mask (void) > || (save_reg_mask > && optimize_size > && ARM_FUNC_TYPE (func_type) == ARM_FT_NORMAL > + && !crtl->tail_call_emit > && !crtl->calls_eh_return)) > save_reg_mask |= 1 << LR_REGNUM; > > diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md > index 64acfea..29cfb17 100644 > --- a/gcc/config/arm/thumb2.md > +++ b/gcc/config/arm/thumb2.md > @@ -267,6 +267,17 @@ > (set_attr "type" "multiple")] > ) > > +;; Pop a single register as its size is preferred over a post-incremental > load > +(define_insn "*thumb2_pop_single" > + [(set (match_operand:SI 0 "low_register_operand" "=r") > + (mem:SI (post_inc:SI (reg:SI SP_REGNUM))))] > + "TARGET_THUMB2 && (reload_in_progress || reload_completed)" > + "pop\t{%0}" > + [(set_attr "type" "load1") > + (set_attr "length" "2") > + (set_attr "predicable" "yes")] > +) > + > ;; We have two alternatives here for memory loads (and similarly for > stores) > ;; to reflect the fact that the permissible constant pool ranges differ > ;; between ldr instructions taking low regs and ldr instructions taking > high > > >