On 01/16/2019 08:23 AM, Björn Töpel wrote: > Den ons 16 jan. 2019 kl 00:50 skrev Daniel Borkmann <dan...@iogearbox.net>: >> >> On 01/15/2019 09:35 AM, Björn Töpel wrote: >>> This commit adds eBPF JIT for RV64G. >>> >>> Codewise, it needs some refactoring. Currently there's a bit too much >>> copy-and-paste going on, and I know some places where I could optimize >>> the code generation a bit (mostly BPF_K type of instructions, dealing >>> with immediates). >> >> Nice work! :) >> >>> From a features perspective, two things are missing: >>> >>> * tail calls >>> * "far-branches", i.e. conditional branches that reach beyond 13b. >>> >>> The test_bpf.ko passes all tests. >> >> Did you also check test_verifier under jit with/without jit hardening >> enabled? That one contains lots of runtime tests as well. Probably makes >> sense to check under CONFIG_BPF_JIT_ALWAYS_ON to see what fails the JIT; >> the test_verifier also contains various tail call tests targeted at JITs, >> for example. >> > > Good point! I will do that. The only selftests/bpf program that I ran > (and passed) was "test_progs". I'll make sure that the complete bpf > selftests suite passes as well! > >> Nit: please definitely also add a MAINTAINERS entry with at least yourself >> under BPF JIT section, and update Documentation/sysctl/net.txt with riscv64. >> > > Ah! Yes, I'll fix that. > >>> Signed-off-by: Björn Töpel <bjorn.to...@gmail.com> >>> --- >>> arch/riscv/net/bpf_jit_comp.c | 1608 +++++++++++++++++++++++++++++++++ >>> 1 file changed, 1608 insertions(+) >>> >>> diff --git a/arch/riscv/net/bpf_jit_comp.c b/arch/riscv/net/bpf_jit_comp.c >>> index 7e359d3249ee..562d56eb8d23 100644 >>> --- a/arch/riscv/net/bpf_jit_comp.c >>> +++ b/arch/riscv/net/bpf_jit_comp.c >>> @@ -1,4 +1,1612 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> +/* >>> + * BPF JIT compiler for RV64G >>> + * >>> + * Copyright(c) 2019 Björn Töpel <bjorn.to...@gmail.com> >>> + * >>> + */ >>> + >>> +#include <linux/bpf.h> >>> +#include <linux/filter.h> >>> +#include <asm/cacheflush.h> >>> + >>> +#define TMP_REG_0 (MAX_BPF_JIT_REG + 0) >>> +#define TMP_REG_1 (MAX_BPF_JIT_REG + 1) >> >> Not used? >> > > Correct! I'll get rid of them. > >>> +#define TAIL_CALL_REG (MAX_BPF_JIT_REG + 2) >>> + >>> +enum rv_register { >>> + RV_REG_ZERO = 0, /* The constant value 0 */ >>> + RV_REG_RA = 1, /* Return address */ >>> + RV_REG_SP = 2, /* Stack pointer */ >>> + RV_REG_GP = 3, /* Global pointer */ >>> + RV_REG_TP = 4, /* Thread pointer */ >>> + RV_REG_T0 = 5, /* Temporaries */ >>> + RV_REG_T1 = 6, >>> + RV_REG_T2 = 7, >>> + RV_REG_FP = 8, >>> + RV_REG_S1 = 9, /* Saved registers */ >>> + RV_REG_A0 = 10, /* Function argument/return values */ >>> + RV_REG_A1 = 11, /* Function arguments */ >>> + RV_REG_A2 = 12, >>> + RV_REG_A3 = 13, >>> + RV_REG_A4 = 14, >>> + RV_REG_A5 = 15, >>> + RV_REG_A6 = 16, >>> + RV_REG_A7 = 17, >>> + RV_REG_S2 = 18, /* Saved registers */ >>> + RV_REG_S3 = 19, >>> + RV_REG_S4 = 20, >>> + RV_REG_S5 = 21, >>> + RV_REG_S6 = 22, >>> + RV_REG_S7 = 23, >>> + RV_REG_S8 = 24, >>> + RV_REG_S9 = 25, >>> + RV_REG_S10 = 26, >>> + RV_REG_S11 = 27, >>> + RV_REG_T3 = 28, /* Temporaries */ >>> + RV_REG_T4 = 29, >>> + RV_REG_T5 = 30, >>> + RV_REG_T6 = 31, >>> +}; >>> + >>> +struct rv_jit_context { >>> + struct bpf_prog *prog; >>> + u32 *insns; /* RV insns */ >>> + int ninsns; >>> + int epilogue_offset; >>> + int *offset; /* BPF to RV */ >>> + unsigned long seen_reg_bits; >>> + int stack_size; >>> +}; >>> + >>> +struct rv_jit_data { >>> + struct bpf_binary_header *header; >>> + u8 *image; >>> + struct rv_jit_context ctx; >>> +}; >>> + >>> +static u8 bpf_to_rv_reg(int bpf_reg, struct rv_jit_context *ctx) >>> +{ >> >> This one can also be simplified by having a simple mapping as in >> other JITs and then mark __set_bit(<reg>) in the small bpf_to_rv_reg() >> helper. >> > > Yeah, I agree. Much better. I'll take that route. > >>> + switch (bpf_reg) { >>> + /* Return value */ >>> + case BPF_REG_0: >>> + __set_bit(RV_REG_A5, &ctx->seen_reg_bits); >>> + return RV_REG_A5; >>> + /* Function arguments */ >>> + case BPF_REG_1: >>> + __set_bit(RV_REG_A0, &ctx->seen_reg_bits); >>> + return RV_REG_A0; >>> + case BPF_REG_2: >>> + __set_bit(RV_REG_A1, &ctx->seen_reg_bits); >>> + return RV_REG_A1; >>> + case BPF_REG_3: >>> + __set_bit(RV_REG_A2, &ctx->seen_reg_bits); >>> + return RV_REG_A2; >>> + case BPF_REG_4: >>> + __set_bit(RV_REG_A3, &ctx->seen_reg_bits); >>> + return RV_REG_A3; >>> + case BPF_REG_5: >>> + __set_bit(RV_REG_A4, &ctx->seen_reg_bits); >>> + return RV_REG_A4; >>> + /* Callee saved registers */ >>> + case BPF_REG_6: >>> + __set_bit(RV_REG_S1, &ctx->seen_reg_bits); >>> + return RV_REG_S1; >>> + case BPF_REG_7: >>> + __set_bit(RV_REG_S2, &ctx->seen_reg_bits); >>> + return RV_REG_S2; >>> + case BPF_REG_8: >>> + __set_bit(RV_REG_S3, &ctx->seen_reg_bits); >>> + return RV_REG_S3; >>> + case BPF_REG_9: >>> + __set_bit(RV_REG_S4, &ctx->seen_reg_bits); >>> + return RV_REG_S4; >>> + /* Stack read-only frame pointer to access stack */ >>> + case BPF_REG_FP: >>> + __set_bit(RV_REG_S5, &ctx->seen_reg_bits); >>> + return RV_REG_S5; >>> + /* Temporary register */ >>> + case BPF_REG_AX: >>> + __set_bit(RV_REG_T0, &ctx->seen_reg_bits); >>> + return RV_REG_T0; >>> + /* Tail call counter */ >>> + case TAIL_CALL_REG: >>> + __set_bit(RV_REG_S6, &ctx->seen_reg_bits); >>> + return RV_REG_S6; >>> + default: >>> + return 0; >>> + } >>> +}; >> [...] >>> + /* tail call */ >>> + case BPF_JMP | BPF_TAIL_CALL: >>> + rd = bpf_to_rv_reg(TAIL_CALL_REG, ctx); >>> + pr_err("bpf-jit: tail call not supported yet!\n"); >>> + return -1; >> >> There are two options here, either fixed size prologue where you can >> then jump over it in tail call case, or dynamic one which would make >> it slower due to reg restore but shrinks image for non-tail calls. > > So, it would be the latter then, which is pretty much like a more > expensive (due to the tail call depth checks) function call.
Right. > For the fixed prologue: how does, say x86, deal with BPF stack usage > in the tail call case? If the caller doesn't use the bpf stack, but > the callee does. From a quick glance in the code, the x86 prologue > still uses aux->stack_depth. If the callee has a different stack usage > that the caller, and then the callee does a function call, wouldn't > this mess up the frame? (Yeah, obviously missing something! :-)) Basically in this case verifier sets stack size to MAX_BPF_STACK when it finds a tail call in the prog, meaning the callee will be reusing <= stack size than the caller and then upon exit unwinds it via leave+ret. Cheers, Daniel