Commit 66eb579e66ecfea5 ("perf: allow for PMU-specific event filtering") added pmu::filter_match. This was intended to avoid HW constraints on events from resulting in extremely pessimistic scheduling.
However, pmu::filter_match is only called for the leader of each event group. When the leader is a SW event, we do not filter the groups, and may fail at pmu::add time, and when this happens we'll give up on scheduling any event groups later in the list until they are rotated ahead of the failing group. This can result in extremely sub-optimal scheduling behaviour, e.g. if running the following on a big.LITTLE platform: $ taskset -c 0 ./perf stat \ -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \ -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \ ls <not counted> context-switches (0.00%) <not counted> armv8_cortex_a57/config=0x11/ (0.00%) 24 context-switches (37.36%) 57589154 armv8_cortex_a53/config=0x11/ (37.36%) Here the 'a53' event group was always eligible to be scheduled, but the a57 group never eligible to be scheduled, as the task was always affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57' group was eligible, but the HW event failed at pmu::add time, resulting in ctx_flexible_sched_in giving up on scheduling further groups with HW events. One way of avoiding this is to check pmu::filter_match on siblings as well as the group leader. If any of these fail their pmu::filter_match, we must skip the entire group before attempting to add any events. Cc: Alexander Shishkin <alexander.shish...@linux.intel.com> Cc: Arnaldo Carvalho de Melo <a...@kernel.org> Cc: Ingo Molnar <mi...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Will Deacon <will.dea...@arm.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Mark Rutland <mark.rutl...@arm.com> --- kernel/events/core.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) I've tried to find a better way of handling this (without needing to walk the siblings list), but so far I'm at a loss. At least it's "only" O(n) in the size of the sibling list we were going to walk anyway. I suspect that at a more fundamental level, I need to stop sharing a perf_hw_context between HW PMUs (i.e. replace task_struct::perf_event_ctxp with something that can handle multiple HW PMUs). From previous attempts I'm not sure if that's going to be possible. Any ideas appreciated! Mark. diff --git a/kernel/events/core.c b/kernel/events/core.c index 9c51ec3..c0b6db0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1678,12 +1678,32 @@ static bool is_orphaned_event(struct perf_event *event) return event->state == PERF_EVENT_STATE_DEAD; } -static inline int pmu_filter_match(struct perf_event *event) +static inline int __pmu_filter_match(struct perf_event *event) { struct pmu *pmu = event->pmu; return pmu->filter_match ? pmu->filter_match(event) : 1; } +/* + * Check whether we should attempt to schedule an event group based on + * PMU-specific filtering. An event group can consist of HW and SW events, + * potentially with a SW leader, so we must check all the filters to determine + * whether a group is schedulable. + */ +static inline int pmu_filter_match(struct perf_event *event) +{ + struct perf_event *child; + + if (!__pmu_filter_match(event)) + return 0; + + list_for_each_entry(child, &event->sibling_list, group_entry) + if (!__pmu_filter_match(child)) + return 0; + + return 1; +} + static inline int event_filter_match(struct perf_event *event) { -- 1.9.1