Hi Peter, > On Oct 15, 2018, at 3:09 PM, Song Liu <songliubrav...@fb.com> wrote: > > > >> On Oct 15, 2018, at 1:34 AM, Peter Zijlstra <pet...@infradead.org> wrote: >> >> On Mon, Oct 15, 2018 at 10:26:06AM +0300, Alexey Budankov wrote: >>> Hi, >>> >>> On 10.10.2018 13:45, Peter Zijlstra wrote: >>>> Hi all, >>>> >>>> There have been various issues and limitations with the way perf uses >>>> (task) contexts to track events. Most notable is the single hardware PMU >>>> task context, which has resulted in a number of yucky things (both >>>> proposed and merged). >>>> >>>> Notably: >>>> >>>> - HW breakpoint PMU >>>> - ARM big.little PMU >>>> - Intel Branch Monitoring PMU >>>> >>>> Since we now track the events in RB trees, we can 'simply' add a pmu >>>> order to them and have them grouped that way, reducing to a single >>>> context. Of course, reality never quite works out that simple, and below >>>> ends up adding an intermediate data structure to bridge the context -> >>>> pmu mapping. >>>> >>>> Something a little like: >>>> >>>> ,------------------------[1:n]---------------------. >>>> V V >>>> perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event >>>> ^ ^ | | >>>> `--------[1:n]---------' `-[n:1]-> pmu <-[1:n]-' >>>> >>>> This patch builds (provided you disable CGROUP_PERF), boots and survives >>>> perf-top without the machine catching fire. >>>> >>>> There's still a fair bit of loose ends (look for XXX), but I think this >>>> is the direction we should be going. >>>> >>>> Comments? >>>> >>>> Not-Quite-Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> >>>> --- >>>> arch/powerpc/perf/core-book3s.c | 4 >>>> arch/x86/events/core.c | 4 >>>> arch/x86/events/intel/core.c | 6 >>>> arch/x86/events/intel/ds.c | 6 >>>> arch/x86/events/intel/lbr.c | 16 >>>> arch/x86/events/perf_event.h | 6 >>>> include/linux/perf_event.h | 80 +- >>>> include/linux/sched.h | 2 >>>> kernel/events/core.c | 1412 >>>> ++++++++++++++++++++-------------------- >>>> 9 files changed, 815 insertions(+), 721 deletions(-) >>> >>> Rewrite is impressive however it doesn't result in code base reduction as >>> it is. >> >> Yeah.. that seems to be nature of these things .. >> >>> Nonetheless there is a clear demand for per pmu events groups tracking and >>> rotation >>> in single cpu context (HW breakpoints, ARM big.little, Intel LBRs) and >>> there is >>> a supply thru groups ordering on RB-tree. >>> >>> This might be driven into the kernel by some new Perf features that would >>> base on >>> that RB-tree groups ordering or by refactoring of existing code but in the >>> way it >>> would result in overall code base reduction thus lowering support cost. >> >> If you have a concrete suggestion on how to reduce complexity? I tried, >> but couldn't find any (without breaking something). >> >> The active lists and pmu_ctx_list could arguably be replaced with >> (slower) iteratons over the RB tree, but you'll still need the per pmu >> nr_events/nr_active counts to determine if rotation is required at all. >> >> And like you know, performance is quite important here too. I'd love to >> reduce complexity while maintaining or improve performance, but that >> rarely if ever happens :/ > > How about this: > > 1. Keep multiple perf_cpu_context per CPU, just like before this patch. > > 2. For perf_event_context, add PMU as an order for the RB tree. > > 3. (hw) pmu->perf_cpu_context->ctx only has events for this PMU (and sw > events moved to this context). > > 4. task->perf_event_ctxp has events for all PMUs. > > With this path, we keep the existing perf_cpu_context/perf_event_context > logic as-is, which I think is simpler than the new logic (with extra > *_pmu_context). And it should also solve the problem. > > Does this make sense? If this doesn't look too broken, I am happy to > draft RFC for it. >
I am not sure whether you missed this one, or found it totally insane. Could you please share your comments on it? My gut feeling is that this would be a simpler patch to solve the problem (two hw PMUs). (It might be less efficient though). Thanks, Song