On Wed, 04 Dec 2013 09:48:44 +0900 Masami Hiramatsu <masami.hiramatsu...@hitachi.com> wrote:
> (2013/12/03 13:28), Alexei Starovoitov wrote: > > Such filters can be written in C and allow safe read-only access to > > any kernel data structure. > > Like systemtap but with safety guaranteed by kernel. > > > > The user can do: > > cat bpf_program > /sys/kernel/debug/tracing/.../filter > > if tracing event is either static or dynamic via kprobe_events. > > > > The program can be anything as long as bpf_check() can verify its safety. > > For example, the user can create kprobe_event on dst_discard() > > and use logically following code inside BPF filter: > > skb = (struct sk_buff *)ctx->regs.di; > > dev = bpf_load_pointer(&skb->dev); > > to access 'struct net_device' > > Since its prototype is 'int dst_discard(struct sk_buff *skb);' > > 'skb' pointer is in 'rdi' register on x86_64 > > bpf_load_pointer() will try to fetch 'dev' field of 'sk_buff' > > structure and will suppress page-fault if pointer is incorrect. > > Hmm, I doubt it is a good way to integrate with ftrace. > I prefer to use this for replacing current ftrace filter, I'm not sure how we can do that. Especially since the bpf is very arch specific, and the current filters work for all archs. > fetch functions and actions. In that case, we can continue > to use current interface but much faster to trace. > Also, we can see what filter/arguments/actions are set > on each event. There's also the problem that the current filters work with the results of what is written to the buffer, not what is passed in by the trace point, as that isn't even displayed to the user. For example, sched_switch gets passed struct task_struct *prev, and *next, from that we save prev_comm, prev_pid, prev_prio, prev_state, next_comm, next_prio and next_state. These are expressed to the user by the format file of the event: field:char prev_comm[32]; offset:16; size:16; signed:1; field:pid_t prev_pid; offset:32; size:4; signed:1; field:int prev_prio; offset:36; size:4; signed:1; field:long prev_state; offset:40; size:8; signed:1; field:char next_comm[32]; offset:48; size:16; signed:1; field:pid_t next_pid; offset:64; size:4; signed:1; field:int next_prio; offset:68; size:4; signed:1; And the filters can check "next_prio > 10" and what not. The bpf program needs to access next->prio. There's nothing that shows the user what is passed to the tracepoint, and from that, what structure member to use from there. The user would be required to look at the source code of the given kernel. A requirement not needed by the current implementation. Also, there's results that can not be trivially converted. Taking a quick look at some TRACE_EVENT() structures, I found bcache_bio that has this: TP_fast_assign( __entry->dev = bio->bi_bdev->bd_dev; __entry->sector = bio->bi_sector; __entry->nr_sector = bio->bi_size >> 9; blk_fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size); ), Where the blk_fill_rwbs() updates the status of the entry->rwbs based on the bi_rw field. A filter must remain backward compatible to something like: rwbs == "w" or rwbs =~ '*w*' Now maybe we can make the filter code use some of the bpf if possible, but to get the result, it still needs to write to the ring buffer, and discard it if it is incorrect. Which will not make it any faster than the original trace, but perhaps faster than the trace + current filter. The speed up that was shown was because we were processing the parameters of the trace point and not the result. That currently requires the user to have full access to the source of the kernel they are tracing. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/