Le 08/12/2023 à 17:30, Naveen N Rao a écrit : > Function profile sequence on powerpc includes two instructions at the > beginning of each function: > > mflr r0 > bl ftrace_caller > > The call to ftrace_caller() gets nop'ed out during kernel boot and is > patched in when ftrace is enabled. > > There are two issues with this: > 1. The 'mflr r0' instruction at the beginning of each function remains > even though ftrace is not being used. > 2. When ftrace is activated, we return from ftrace_caller() with a bctr > instruction to preserve r0 and LR, resulting in the link stack > becoming unbalanced. > > To address (1), we have tried to nop'out the 'mflr r0' instruction when > nop'ing out the call to ftrace_caller() and restoring it when enabling > ftrace. But, that required additional synchronization slowing down > ftrace activation. It also left an additional nop instruction at the > beginning of each function and that wasn't desirable on 32-bit powerpc. > > Instead of that, move the function profile sequence out-of-line leaving > a single nop at function entry. On ftrace activation, the nop is changed > to an unconditional branch to the out-of-line sequence that in turn > calls ftrace_caller(). This removes the need for complex synchronization > during ftrace activation and simplifies the code. More importantly, this > improves performance of the kernel when ftrace is not in use. > > To address (2), change the ftrace trampoline to return with a 'blr' > instruction with the original return address in r0 intact. Then, an > additional 'mtlr r0' instruction in the function profile sequence can > move the correct return address back to LR. > > With the above two changes, the function profile sequence now looks like > the following: > > [func: # GEP -- 64-bit powerpc, optional > addis r2,r12,imm1 > addi r2,r2,imm2] > tramp: > mflr r0 > bl ftrace_caller > mtlr r0 > b func > nop > [nop] # 64-bit powerpc only > func: # LEP > nop > > On 32-bit powerpc, the ftrace mcount trampoline is now completely > outside the function. This is also the case on 64-bit powerpc for > functions that do not need a GEP. However, for functions that need a > GEP, the additional instructions are inserted between the GEP and the > LEP. Since we can only have a fixed number of instructions between GEP > and LEP, we choose to emit 6 instructions. Four of those instructions > are used for the function profile sequence and two instruction slots are > reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On > 32-bit powerpc, we emit one additional nop for this purpose resulting in > a total of 5 nops before function entry. > > To enable ftrace, the nop at function entry is changed to an > unconditional branch to 'tramp'. The call to ftrace_caller() may be > updated to ftrace_regs_caller() depending on the registered ftrace ops. > On 64-bit powerpc, we additionally change the instruction at 'tramp' to > 'mflr r0' from an unconditional branch back to func+4. This is so that > functions entered through the GEP can skip the function profile sequence > unless ftrace is enabled. > > With the context_switch microbenchmark on a P9 machine, there is a > performance improvement of ~6% with this patch applied, going from 650k > context switches to 690k context switches without ftrace enabled. With > ftrace enabled, the performance was similar at 86k context switches.
Wondering how significant that context_switch micorbenchmark is. I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the results: # ./context_switch --no-fp Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no vdso:no On 885, I get the following results before and after your patch. CONFIG_FTRACE not selected : 44,9k CONFIG_FTRACE selected, before : 32,8k CONFIG_FTRACE selected, after : 33,6k All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected result is only 34,4. On 8321: CONFIG_FTRACE not selected : 100,3k CONFIG_FTRACE selected, before : 72,5k CONFIG_FTRACE selected, after : 116k So the results look odd to me. > > The downside of this approach is the increase in vmlinux size, > especially on 32-bit powerpc. We now emit 3 additional instructions for > each function (excluding the one or two instructions for supporting > DYNAMIC_FTRACE_WITH_CALL_OPS). On 64-bit powerpc with the current > implementation of -fpatchable-function-entry though, this is not > avoidable since we are forced to emit 6 instructions between the GEP and > the LEP even if we are to only support DYNAMIC_FTRACE_WITH_CALL_OPS. The increase is almost 5% on the few 32 bits defconfig I have tested. That's a lot. On 32 bits powerpc, only the e500 has a link stack that could end up being unbalanced. Could we keep the bctr and avoid the mtlr and the jump in the trampoline ? On 8xx a NOP is one cycle, a taken branch is 2 cycles, but the second cycle is a bubble that most of the time gets filled by following operations. On the 83xx, branches and NOPs are supposed to be seamless. So is that out-of-line trampoline really worth it ? Maybe keeping the ftrace instructions at the begining and just replacing the mflr by an jump when ftrace is off would help reduce the ftrace insns by one more instruction. > > Signed-off-by: Naveen N Rao <nav...@kernel.org> > --- > diff --git a/arch/powerpc/include/asm/code-patching.h > b/arch/powerpc/include/asm/code-patching.h > index 84f6ccd7de7a..9a54bb9e0dde 100644 > --- a/arch/powerpc/include/asm/code-patching.h > +++ b/arch/powerpc/include/asm/code-patching.h > @@ -185,10 +185,21 @@ static inline unsigned long ppc_function_entry(void > *func) > */ > if ((((*insn & OP_RT_RA_MASK) == ADDIS_R2_R12) || > ((*insn & OP_RT_RA_MASK) == LIS_R2)) && > - ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2)) > + ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2)) { > +#ifdef CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY Can you replace by IS_ENABLED() ? > + /* > + * Heuristic: look for the 'mtlr r0' instruction assuming > ftrace init is done. > + * If it is not found, look for two consecutive nops after the > GEP. > + * Longer term, we really should be parsing the symbol table to > determine LEP. > + */ > + if ((*(insn+4) == PPC_RAW_MTLR(_R0)) || > + ((*(insn+2) == PPC_RAW_NOP() && *(insn+3) == > PPC_RAW_NOP()))) > + return (unsigned long)(insn + 8); > +#endif > return (unsigned long)(insn + 2); > - else > + } else { > return (unsigned long)func; > + } > #elif defined(CONFIG_PPC64_ELF_ABI_V1) > /* > * On PPC64 ABIv1 the function pointer actually points to the > diff --git a/arch/powerpc/kernel/trace/ftrace.c > b/arch/powerpc/kernel/trace/ftrace.c > index 2956196c98ff..d3b4949142a8 100644 > --- a/arch/powerpc/kernel/trace/ftrace.c > +++ b/arch/powerpc/kernel/trace/ftrace.c > @@ -217,15 +274,62 @@ int ftrace_init_nop(struct module *mod, struct > dyn_ftrace *rec) > { > unsigned long addr, ip = rec->ip; > ppc_inst_t old, new; > - int ret = 0; > + int i, ret = 0; > + u32 ftrace_mcount_tramp_insns[] = { > +#ifdef CONFIG_PPC64 > + PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET + MCOUNT_INSN_SIZE), > +#else > + PPC_RAW_MFLR(_R0), > +#endif Can it be based on IS_ENABLED(CONFIG_PPC64) instead ? > + PPC_RAW_BL(0), /* bl ftrace_caller */ > + PPC_RAW_MTLR(_R0), /* also see update ppc_function_entry() */ > + PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET - MCOUNT_INSN_SIZE * > 2) > + }; > +