On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubrav...@fb.com> wrote: > > > > > On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhy...@kernel.org> wrote: > > > > Hi, > > > > On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubrav...@fb.com> wrote: > >> > >> perf uses performance monitoring counters (PMCs) to monitor system > >> performance. The PMCs are limited hardware resources. For example, > >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > >> > >> Modern data center systems use these PMCs in many different ways: > >> system level monitoring, (maybe nested) container level monitoring, per > >> process monitoring, profiling (in sample mode), etc. In some cases, > >> there are more active perf_events than available hardware PMCs. To allow > >> all perf_events to have a chance to run, it is necessary to do expensive > >> time multiplexing of events. > >> > >> On the other hand, many monitoring tools count the common metrics (cycles, > >> instructions). It is a waste to have multiple tools create multiple > >> perf_events of "cycles" and occupy multiple PMCs. > >> > >> bperf tries to reduce such wastes by allowing multiple perf_events of > >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead > >> of having each perf-stat session to read its own perf_events, bperf uses > >> BPF programs to read the perf_events and aggregate readings to BPF maps. > >> Then, the perf-stat session(s) reads the values from these BPF maps. > >> > >> Please refer to the comment before the definition of bperf_ops for the > >> description of bperf architecture. > > > > Interesting! Actually I thought about something similar before, > > but my BPF knowledge is outdated. So I need to catch up but > > failed to have some time for it so far. ;-) > > > >> > >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat. > >> bperf uses a BPF hashmap to share information about BPF programs and maps > >> used by bperf. This map is pinned to bpffs. The default address is > >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option > >> --attr-map. > >> > >> --- > >> Known limitations: > >> 1. Do not support per cgroup events; > >> 2. Do not support monitoring of BPF program (perf-stat -b); > >> 3. Do not support event groups. > > > > In my case, per cgroup event counting is very important. > > And I'd like to do that with lots of cpus and cgroups. > > We can easily extend this approach to support cgroups events. I didn't > implement it to keep the first version simple.
OK. > > > So I'm working on an in-kernel solution (without BPF), > > I hope to share it soon. > > This is interesting! I cannot wait to see how it looks like. I spent > quite some time try to enable in kernel sharing (not just cgroup > events), but finally decided to try BPF approach. Well I found it hard to support generic event sharing that works for all use cases. So I'm focusing on the per cgroup case only. > > > > > And for event groups, it seems the current implementation > > cannot handle more than one event (not even in a group). > > That could be a serious limitation.. > > It supports multiple events. Multiple events are independent, i.e., > "cycles" and "instructions" would use two independent leader programs. OK, then do you need multiple bperf_attr_maps? Does it work for an arbitrary number of events? > > > > >> > >> The following commands have been tested: > >> > >> perf stat --use-bpf -e cycles -a > >> perf stat --use-bpf -e cycles -C 1,3,4 > >> perf stat --use-bpf -e cycles -p 123 > >> perf stat --use-bpf -e cycles -t 100,101 > > > > Hmm... so it loads both leader and follower programs if needed, right? > > Does it support multiple followers with different targets at the same time? > > Yes, the whole idea is to have one leader program and multiple follower > programs. If we only run one of these commands at a time, it will load > one leader and one follower. If we run multiple of them in parallel, > they will share the same leader program and load multiple follower > programs. > > I actually tested more than the commands above. The list actually means > we support -a, -C -p, and -t. > > Currently, this works for multiple events, and different parallel > perf-stat. The two commands below will work well in parallel: > > perf stat --use-bpf -e ref-cycles,instructions -a > perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5 > > Note the use of ref-cycles, which can only use one counter on Intel CPUs. > With this approach, the above two commands will not do time multiplexing > on ref-cycles. Awesome! Thanks, Namhyung