On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > > On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > > > > I have considered adding MUNMAP to match existing MMAP, but went > > > without it because I didn't want to introduce new bit in perf_event_attr > > > and emit these new events in a misbalanced conditional way for prog > > > load/unload. > > > Like old perf is asking kernel for mmap events via mmap bit, so prog load > > > events > > > > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 > > || > > perf_event_attr.mmap2 = 1 will cause the new kernel to emit > > PERF_RECORD_MMAP records for the range of addresses that a BPF program > > is being loaded on, right? > > right. it would be weird when prog load events are there, but not unload. > > > > will be in perf.data, but old perf report won't recognize them anyway. > > > > Why not? It should lookup the symbol and find it in the rb_tree of maps, > > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the > > BPF core, no? It'll be an unresolved symbol, but a resolved map. > > > > > Whereas new perf would certainly want to catch bpf events and will set > > > both mmap and mumap bits. > > > > new perf with your code will find a symbol, not a map, because your code > > catches a special case PERF_RECORD_MMAP and instead of creating a > > 'struct map' will create a 'struct symbol' and insert it in the kallsyms > > 'struct map', right? > > right. > bpf progs are more similar to kernel functions than to modules. > For modules it makes sense to create a new map and insert symbols into it. > For bpf JITed images there is no DSO to parse. > Single bpf elf file may contain multiple bpf progs and each prog may contain > multiple bpf functions. They will be loaded at different time and > will have different life time. > > > In theory the old perf should catch the PERF_RECORD_MMAP with a string > > in the filename part and insert a new map into the kernel mmap rb_tree, > > and then samples would be resolved to this map, but since there is no > > backing DSO with a symtab, it would stop at that, just stating that the > > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly > > there is something in there that makes it ignore this PERF_RECORD_MMAP > > emitted by the BPF kernel code when loading a new program. > > In /proc/kcore there is already a section for module range. > Hence when perf processes bpf load/unload events the map is already created. > Therefore the patch 3 only searches for it and inserts new symbol into it. > > In that sense the reuse of RECORD_MMAP event for bpf progs is indeed > not exactly clean, since no new map is created. > It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ? > > Such event potentially can be used for offline ksym resolution. > perf could process /proc/kallsyms during perf record and emit all of them > as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run > on a different server and still find the right symbols. > > I guess, we can do bpf specific events too and keep RECORD_MMAP as-is. > How about single PERF_RECORD_BPF event with internal flag for load/unload ? > > > Right, that is another unfortunate state of affairs, kernel module > > load/unload should already be supported, reported by the kernel via a > > proper PERF_RECORD_MODULE_LOAD/UNLOAD > > I agree with Peter here. It would nice, but low priority. > modules are mostly static. Loaded once and stay there. > > > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > > should include a build-id, to avoid either userspace getting confused > > when there is an update of some mmap DSO, for long running sessions, for > > instance, or to have to scan the just recorded perf.data file for DSOs > > with samples to then read it from the file system (more races). > > > > Have you ever considered having a build-id for bpf objects that could be > > used here? > > build-id concept is not applicable to bpf. > bpf elf files on the disc don't have good correlation with what is > running in the kernel. bpf bytestream is converted and optimized > by the verifier. Then JITed. > So debug info left in .o file and original bpf bytestream in .o are > mostly useless. > For bpf programs we have 'program tag'. It is computed over original > bpf bytestream, so both kernel and user space can compute it. > In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original > source code of the program, so users looking at kernel stack traces > with bpf_prog_TAG can find the source. > It's similar to build-id, but not going to help perf to annotate > actual x86 instructions inside JITed image and show src code. > Since JIT runs in the kernel this problem cannot be solved by user space only. > It's a difficult problem and we have a plan to tackle that, > but it's step 2. A bunch of infra is needed on bpf side to > preserve the association during src_in_C -> original bpf insns -> > translated bpf insns -> JITed asm. > Then report it back to user space and then teach perf to properly annotate > progs. > > > Will the JITed code from some BPF bytecode be different if the same > > bytecode is JITed in several machines, all having the exact same > > hardware? > > Yes. JITed code will be different depending on sysctl knobs (like const > blinding) > So the same original bpf byte stream loaded at different times > may have different JITed image. > > Even without security features like blinding the JIT can be different. > the bpf maps are separate from bpf progs. The bpf map is created first. > Then the same bpf instruction stream (depending on map type that it uses) > may be optimized by the verifier differently causing difference in JIT. > > > > (instead of passing kallsyms's bpf prog name in event->mmap.filename) > > > but bpf functions don't have their own prog_id. Multiple bpf funcs > > > with different JITed blobs are considered to be a part of single prog_id. > > > So as a step one I'm only extending RECORD_MMAP with addr and kallsym > > > name of bpf function/prog. > > > As a step two the plan is to add notification mechanism for prog > > > load/unload > > > that will include prog_id and design new synthetic RECORD_* events in > > > perf user space that will contain JITed code, line info, BTF, etc. > > > > So, will the kernel JIT a bytecode, load it somewhere and run it, then, > > when unloading it, keep it somewhere (a filesystem with some limits, > > etc) so that at some later time (with some timeouts) tooling can, using > > its id/buildid cookie get the contents and symbol table to have a better > > annotation experience? > > yes. The plan is to let perf fetch JITed image via BPF_OBJ_GET_INFO_BY_FD cmd > during perf record run and store it inside perf.data in synthetic records. > Then perf report can annotate it later.
Hi Peter and Arnaldo, I am working with Alexei on the idea of fetching BPF program information via BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT to perf_event_type, and dumped these events to perf event ring buffer. I found that perf will not process event until the end of perf-record: root@virt-test:~# ~/perf record -ag -- sleep 10 ...... 10 seconds later [ perf record: Woken up 34 times to write data ] machine__process_bpf_event: prog_id 6 loaded machine__process_bpf_event: prog_id 6 unloaded [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] In this example, the bpf program was loaded and then unloaded in another terminal. When machine__process_bpf_event() processes the load event, the bpf program is already unloaded. Therefore, machine__process_bpf_event() will not be able to get information about the program via BPF_OBJ_GET_INFO_BY_FD cmd. To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD as soon as perf get the event from kernel. I looked around the perf code for a while. But I haven't found a good example where some events are processed before the end of perf-record. Could you please help me with this? Thanks, Song