On Sun, Mar 9, 2014 at 3:00 PM, Daniel Borkmann <borkm...@iogearbox.net> wrote: > On 03/09/2014 06:08 PM, Alexei Starovoitov wrote: >> >> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkm...@iogearbox.net> >> wrote: >>> >>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote: >>>> >>>> >>>> Extended BPF extends old BPF in the following ways: >>>> - from 2 to 10 registers >>>> Original BPF has two registers (A and X) and hidden frame pointer. >>>> Extended BPF has ten registers and read-only frame pointer. >>>> - from 32-bit registers to 64-bit registers >>>> semantics of old 32-bit ALU operations are preserved via 32-bit >>>> subregisters >>>> - if (cond) jump_true; else jump_false; >>>> old BPF insns are replaced with: >>>> if (cond) jump_true; /* else fallthrough */ >>>> - adds signed > and >= insns >>>> - 16 4-byte stack slots for register spill-fill replaced with >>>> up to 512 bytes of multi-use stack space >>>> - introduces bpf_call insn and register passing convention for zero >>>> overhead calls from/to other kernel functions (not part of this >>>> patch) >>>> - adds arithmetic right shift insn >>>> - adds swab32/swab64 insns >>>> - adds atomic_add insn >>>> - old tax/txa insns are replaced with 'mov dst,src' insn >>>> >>>> Extended BPF is designed to be JITed with one to one mapping, which >>>> allows GCC/LLVM backends to generate optimized BPF code that performs >>>> almost as fast as natively compiled code >>>> >>>> sk_convert_filter() remaps old style insns into extended: >>>> 'sock_filter' instructions are remapped on the fly to >>>> 'sock_filter_ext' extended instructions when >>>> sysctl net.core.bpf_ext_enable=1 >>>> >>>> Old filter comes through sk_attach_filter() or >>>> sk_unattached_filter_create() >>>> if (bpf_ext_enable) { >>>> convert to new >>>> sk_chk_filter() - check old bpf >>>> use sk_run_filter_ext() - new interpreter >>>> } else { >>>> sk_chk_filter() - check old bpf >>>> if (bpf_jit_enable) >>>> use old jit >>>> else >>>> use sk_run_filter() - old interpreter >>>> } >>>> >>>> sk_run_filter_ext() interpreter is noticeably faster >>>> than sk_run_filter() for two reasons: >>>> >>>> 1.fall-through jumps >>>> Old BPF jump instructions are forced to go either 'true' or 'false' >>>> branch which causes branch-miss penalty. >>>> Extended BPF jump instructions have one branch and fall-through, >>>> which fit CPU branch predictor logic better. >>>> 'perf stat' shows drastic difference for branch-misses. >>>> >>>> 2.jump-threaded implementation of interpreter vs switch statement >>>> Instead of single tablejump at the top of 'switch' statement, GCC >>>> will >>>> generate multiple tablejump instructions, which helps CPU branch >>>> predictor >>>> >>>> Performance of two BPF filters generated by libpcap was measured >>>> on x86_64, i386 and arm32. >>>> >>>> fprog #1 is taken from Documentation/networking/filter.txt: >>>> tcpdump -i eth0 port 22 -dd >>>> >>>> fprog #2 is taken from 'man tcpdump': >>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - >>>> ((tcp[12]&0xf0)>>2)) != 0)' -dd >>>> >>>> Other libpcap programs have similar performance differences. >>>> >>>> Raw performance data from BPF micro-benchmark: >>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) >>>> time in nsec per call, smaller is better >>>> --x86_64-- >>>> fprog #1 fprog #1 fprog #2 fprog #2 >>>> cache-hit cache-miss cache-hit cache-miss >>>> old BPF 90 101 192 202 >>>> ext BPF 31 71 47 97 >>>> old BPF jit 12 34 17 44 >>>> ext BPF jit TBD >>>> >>>> --i386-- >>>> fprog #1 fprog #1 fprog #2 fprog #2 >>>> cache-hit cache-miss cache-hit cache-miss >>>> old BPF 107 136 227 252 >>>> ext BPF 40 119 69 172 >>>> >>>> --arm32-- >>>> fprog #1 fprog #1 fprog #2 fprog #2 >>>> cache-hit cache-miss cache-hit cache-miss >>>> old BPF 202 300 475 540 >>>> ext BPF 180 270 330 470 >>>> old BPF jit 26 182 37 202 >>>> new BPF jit TBD >>>> >>>> Tested with trinify BPF fuzzer >>>> >>>> Future work: >>>> >>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf >>>> >>>> 1. add extended BPF JIT for x86_64 >>>> >>>> 2. add inband old/new demux and extended BPF verifier, so that new >>>> programs >>>> can be loaded through old sk_attach_filter() and >>>> sk_unattached_filter_create() >>>> interfaces >>>> >>>> 3. tracing filters systemtap-like with extended BPF >>>> >>>> 4. OVS with extended BPF >>>> >>>> 5. nftables with extended BPF >>>> >>>> Signed-off-by: Alexei Starovoitov <a...@plumgrid.com> >>>> Acked-by: Hagen Paul Pfeifer <ha...@jauu.net> >>>> Reviewed-by: Daniel Borkmann <dbork...@redhat.com> >>> >>> >>> >>> One more question or possible issue that came through my mind: When >>> someone attaches a socket filter from user space, and bpf_ext_enable=1 >>> then the old filter will transparently be converted to the new >>> representation. If then user space (e.g. through checkpoint restore) >>> will issue a sk_get_filter() and thus we're calling sk_decode_filter() >>> on sk->sk_filter and, therefore, try to decode what we stored in >>> insns_ext[] with the assumption we still have the old code. Would that >>> actually crash (or leak memory, or just return garbage), as we access >>> decodes[] array with filt->code? Would be great if you could >>> double-check. >> >> >> ohh. yes. missed that. >> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter. >> This way the user space can see how old bpf filter was converted. >> >> Of course we can allocate extra memory and keep original bpf code there >> just to return it via sk_get_filter(), but that seems overkill. > > > Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket > filter program (v2)"). > > I think the issue can be that when applications could get migrated > from one machine to another and their kernel won't support ebpf yet, > then filter could not get loaded this way as it's expected to return > what the user loaded. The trade-off, however, is that the original > BPF code needs to be stored as well. :(
I see. ...even on one machine: bpf_ext=1, attach, get_filter, bpf_ext=0, re-attach... So we need to save original. At least we don't need to keep it for 'unattached' filters. Should memory come from sk_optmem budget or plain kmalloc is enough ? Latter would have simpler implementation, but former is probably cleaner? Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/