This RFC introduces indirect call promotion in runtime, which for the matter of simplification (and branding) will be called here "relpolines" (relative call + trampoline). Relpolines are mainly intended as a way of reducing retpoline overheads due to Spectre v2.
Unlike indirect call promotion through profile guided optimization, the proposed approach does not require a profiling stage, works well with modules whose address is unknown and can adapt to changing workloads. The main idea is simple: for every indirect call, we inject a piece of code with fast- and slow-path calls. The fast path is used if the target matches the expected (hot) target. The slow-path uses a retpoline. During training, the slow-path is set to call a function that saves the call source and target in a hash-table and keep count for call frequency. The most common target is then patched into the hot path. The patching is done on-the-fly by patching the conditional branch (opcode and offset) that is used to compare the target to the hot target. This allows to direct all cores to the fast-path, while patching the slow-path and vice-versa. Patching follows 2 more rules: (1) Only patch a single byte when the code might be executed by any core. (2) When patching more than one byte, ensure that all cores do not run the to-be-patched-code by preventing this code from being preempted, and using synchronize_sched() after patching the branch that jumps over this code. Changing all the indirect calls to use relpolines is done using assembly macro magic. There are alternative solutions, but this one is relatively simple and transparent. There is also logic to retrain the software predictor, but the policy it uses may need to be refined. Eventually the results are not bad (2 VCPU VM, throughput reported): base relpoline ---- --------- nginx 22898 25178 (+10%) redis-ycsb 24523 25486 (+4%) dbench 2144 2103 (+2%) When retpolines are disabled, and if retraining is off, performance benefits are up to 2% (nginx), but are much less impressive. There are several open issues: retraining should be done when modules are removed; CPU hotplug is not supported, x86-32 is probably broken and the Makefile does not rebuild when the relpoline code is changed. Having said that, I am worried that some of the approaches I took would challenge the new code-of-conduct, so I though of getting some feedback before putting more effort into it. Nadav Amit (5): x86: introduce preemption disable prefix x86: patch indirect branch promotion x86: interface for accessing indirect branch locations x86: learning and patching indirect branch targets x86: relpoline: disabling interface arch/x86/entry/entry_64.S | 10 + arch/x86/include/asm/nospec-branch.h | 158 +++++ arch/x86/include/asm/sections.h | 2 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/asm-offsets.c | 6 + arch/x86/kernel/macros.S | 1 + arch/x86/kernel/nospec-branch.c | 899 +++++++++++++++++++++++++++ arch/x86/kernel/vmlinux.lds.S | 7 + arch/x86/lib/retpoline.S | 75 +++ include/linux/module.h | 5 + kernel/module.c | 8 + kernel/seccomp.c | 2 + 12 files changed, 1174 insertions(+) create mode 100644 arch/x86/kernel/nospec-branch.c -- 2.17.1