Hi All, V1 patches: http://thread.gmane.org/gmane.linux.kernel/1605783 V2 patches: http://thread.gmane.org/gmane.linux.kernel/1642325
V3 summary: - as suggested by Daniel added on the fly converter from old BPF (aka BPF32) into extended BPF (aka BPF64) - as suggested by Peter Anvin added 32-bit subregisters they don't add much to interpreter speed, but simplify bpf32->bpf64 mapping - added sysctl net.core.bpf64_enable flag if enabled, old BPF filters will be converted to BPF64 and will be used by tcpdump/cls/xtables. safety of the filters is verified by old BPF sk_chk_filter() BPF64's bpf_check() is dropped from this patch to simplify review Addition of 32-bit subregs require some work on BPF64 x86_64 JIT, so it's not included in this patch set. LLVM BPF64 backend also needs to be taught to take advantage of 32-bit subregs. Initially BPF64 instruction set was designed for max performance after JIT, Now it was tweaked for good interpreter speeds as well. Eventually BPF64 can completely replace existing BPF on all architectures. Two key reasons why BPF64 interpreter is noticeably faster than existing BPF32 interpreter: 1.fall-through jumps In BPF32 jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. BPF64 jump instructions have one branch and fall-through, which fit CPU branch predictor logic better. 'perf stat' shows drastic difference for branch-misses. 2.jump-threaded implementation of interpreter vs switch statement Instead of single tablejump at the top of 'switch' statement, GCC will generate multiple tablejump instructions, which helps CPU branch predictor Performance of two BPF filters generated by libpcap was measured on x86_64, i386 and arm32. fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Other libpcap programs have similar performance differences. Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) time in nsec per call, smaller is better --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss BPF32 90 98 207 220 BPF64 28 85 60 108 BPF32_JIT 12 33 17 44 BPF64_JIT TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss BPF32 107 136 227 252 BPF64 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss BPF32 202 300 475 540 BPF64 139 270 296 470 BPF32_JIT 26 182 37 202 BPF64_JIT TBD on Intel cpus BPF64 interpreter is significantly faster than old BPF interpreter. Existing BPF32_JIT is obviously even faster. BPF64_JIT has similar performance. Tested with Daniel's 'trinify BPF fuzzer' TODO: - bpf32->bpf64 converter doesn't recognize seccomp and negative offsets yet, fix that - add 32-bit subregs to BPF64 x86_64 JIT and LLVM backend - add bpf64 verifier, so that tcpdump/cls/xt and others can insert both bpf32 and bpf64 programs through the same interface - add bpf tables, complete 'dropmonitor' and get back to systemtap-like probes with bpf64 Please review. Thanks! Alexei Starovoitov (1): bpf32->bpf64 mapper and bpf64 interpreter include/linux/filter.h | 9 +- include/linux/netdevice.h | 1 + include/uapi/linux/filter.h | 37 ++- net/core/Makefile | 2 +- net/core/bpf_run.c | 766 +++++++++++++++++++++++++++++++++++++++++++ net/core/filter.c | 114 ++++++- net/core/sysctl_net_core.c | 7 + 7 files changed, 913 insertions(+), 23 deletions(-) create mode 100644 net/core/bpf_run.c -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/