Hi Björn, On Thu, Mar 7, 2024 at 8:27 PM Björn Töpel <bj...@kernel.org> wrote: > > Puranjay! > > Puranjay Mohan <puranja...@gmail.com> writes: > > > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V. > > This allows each ftrace callsite to provide an ftrace_ops to the common > > ftrace trampoline, allowing each callsite to invoke distinct tracer > > functions without the need to fall back to list processing or to > > allocate custom trampolines for each callsite. This significantly speeds > > up cases where multiple distinct trace functions are used and callsites > > are mostly traced by a single tracer. > > > > The idea and most of the implementation is taken from the ARM64's > > implementation of the same feature. The idea is to place a pointer to > > the ftrace_ops as a literal at a fixed offset from the function entry > > point, which can be recovered by the common ftrace trampoline. > > Not really a review, but some more background; Another rationale (on-top > of the improved per-call performance!) for CALL_OPS was to use it to > build ftrace direct call support (which BPF uses a lot!). Mark, please > correct me if I'm lying here! > > On Arm64, CALL_OPS makes it possible to implement direct calls, while > only patching one BL instruction -- nice! > > On RISC-V we cannot use use the same ideas as Arm64 straight off, > because the range of jal (compare to BL) is simply too short (+/-1M). > So, on RISC-V we need to use a full auipc/jal pair (the text patching > story is another chapter, but let's leave that aside for now). Since we > have to patch multiple instructions, the cmodx situation doesn't really > improve with CALL_OPS. > > Let's say that we continue building on your patch and implement direct > calls on CALL_OPS for RISC-V as well. > > From Florent's commit message for direct calls: > > | There are a few cases to distinguish: > | - If a direct call ops is the only one tracing a function: > | - If the direct called trampoline is within the reach of a BL > | instruction > | -> the ftrace patchsite jumps to the trampoline > | - Else > | -> the ftrace patchsite jumps to the ftrace_caller trampoline > which > | reads the ops pointer in the patchsite and jumps to the direct > | call address stored in the ops > | - Else > | -> the ftrace patchsite jumps to the ftrace_caller trampoline and its > | ops literal points to ftrace_list_ops so it iterates over all > | registered ftrace ops, including the direct call ops and calls its > | call_direct_funcs handler which stores the direct called > | trampoline's address in the ftrace_regs and the ftrace_caller > | trampoline will return to that address instead of returning to the > | traced function > > On RISC-V, where auipc/jalr is used, the direct called trampoline would > always be reachable, and then first Else-clause would never be entered. > This means the the performance for direct calls would be the same as the > one we have today (i.e. no regression!). > > RISC-V does like x86 does (-ish) -- patch multiple instructions, long > reach. > > Arm64 uses CALL_OPS and patch one instruction BL. > > Now, with this background in mind, compared to what we have today, > CALL_OPS would give us (again assuming we're using it for direct calls): > > * Better performance for tracer per-call (faster ops lookup) GOOD
^ this was the only motivation for me to implement this patch. I don't think implementing direct calls over call ops is fruitful for RISC-V because once the auipc/jalr can be patched atomically, the direct call trampoline is always reachable. Solving the atomic text patching problem would be fun!! I am eager to see how it will be solved. > * Larger text size (function alignment + extra nops) BAD > * Same direct call performance NEUTRAL > * Same complicated text patching required NEUTRAL > > It would be interesting to see how the per-call performance would > improve on x86 with CALL_OPS! ;-) If I remember from Steven's talk, x86 uses dynamically allocated trampolines for per callsite tracers, would CALL_OPS provide better performance than that? > > I'm trying to wrap my head if it makes sense to have it on RISC-V, given > that we're a bit different from Arm64. Does the scale tip to the GOOD > side? > > Oh, and we really need to see performance numbers on real HW! I have a > VF2 that I could try this series on. It would be great if you can do it :D. Thanks, Puranjay