> On Mar 3, 2018, at 5:39 AM, Jiri Olsa <jo...@redhat.com> wrote:
> 
> On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
>> When there are more perf_event's than hardware PMCs, perf rotate events
>> so that all events get chance to run. Currently, the rotation works as:
>>  sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>>  rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>>  try sched_in flexible_groups in cpuctx->ctx;
>>  try sched_in flexible_groups in cpuctx->task_ctx.
>> 
>> This approach has some potential issues:
>>  1. if different rotations of flexible_groups in cpuctx->ctx occupy
>>     all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run
>>     at all.
>>  2. if pinned_groups occupy all hardware PMC, the rotation triggers per
>>     perf_event_mux_interval_ms. But it couldn't schedule any events.
>>  3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are
>>     rotated separately, there are N x M possible combinations. It is
>>     difficult to remember all the rotation combinations and reuse these
>>     combinations. As a result, it is necessary to try sched_in the
>>     flexible_groups on each rotation.
>> 
>> This patch tries to do the rotation differently. Each perf_event in the
>> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's
>> are assigned during the first few rotations after any changes in
>> perf_events attached to the cpuctx. Once all the rotation_id's are
>> assigned for all events in the cpuctx, perf_rotate_context() simply
>> picks the next rotation to use, so there is no more "try to sched_in"
>> for future rotations.
>> 
>> Special rotation_id's are introduced to handle the issues above.
>> flexible_groups that conflicts with pinned_groups are marked as
>> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups
>> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get
>> equal chance to run (improves issue 1).
> 
> hum, so the improvement is that cpuctx could eventually give
> up some space for task_ctx events, but both ctxs still rotate
> separately no? you keep rotation ID per single context..

With this approach, both ctxs are rotated together. It is possible that
cpuctx->ctx only has events for rotation 0, 1; while cpuctx->task_ctx 
only has events for rotation 2, 3. But both of them will rotate among
0, 1, 2, 3. 

num_rotations and curr_rotation could be part of cpuctx, as it is 
eventually shared among two contexts. 

>> 
>> With this approach, we only do complex scheduling of flexible_groups
>> once. This enables us to do more complex schduling, for example, Sharing
>> PMU counters across compatible events:
>>   https://lkml.org/lkml/2017/12/1/410.
> 
> how could this code do that?

We still need a lot more work to get PMU sharing work. I think one of the 
problem with Tejun's RFC is more expensive scheduling. This RFC tries to 
pre-compute all rotations, so we only need to these expensive scheduling 
once. 

> 
>> 
>> There are also some potential downsides of this approach.
>> 
>> First, it gives all flexible_groups exactly same chance to run, so it
>> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned
>> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will
>> NOT try any of ABCD in rotation-1.
>> 
>> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
>> exact same priority and equal chance to run. I am not sure whether this
>> will change the behavior in some use cases.
>> 
>> Please kindly let me know whether this approach makes sense.
> 
> SNIP
> 
>> +/*
>> + * identify always_on and always_off groups in flexible_groups, call
>> + * group_sched_in() for always_on groups
>> + */
>> +static void ctx_pick_always_on_off_groups(struct perf_event_context *ctx,
>> +                                      struct perf_cpu_context *cpuctx)
>> +{
>> +    struct perf_event *event;
>> +
>> +    list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>> +            if (event->group_caps & PERF_EV_CAP_SOFTWARE) {
>> +                    event->rotation_id = PERF_ROTATION_ID_ALWAYS_ON;
>> +                    ctx->nr_sched++;
>> +                    WARN_ON(group_sched_in(event, cpuctx, ctx));
>> +                    continue;
>> +            }
>> +            if (group_sched_in(event, cpuctx, ctx)) {
>> +                    event->rotation_id = PERF_ROTATION_ID_ALWAYS_OFF;
>> +                    ctx->nr_sched++;
> 
> should ctx->nr_sched be incremented in the 'else' leg?

The else leg means the event does not conflict with pinned groups, so it 
will be scheduled later in ctx_add_rotation(). This function only handles
always_on and always_off events. 

> 
>> +            }
>> +            group_sched_out(event, cpuctx, ctx);
>> +    }
>> +}
>> +
>> +/* add unassigned flexible_groups to new rotation_id */
>> +static void ctx_add_rotation(struct perf_event_context *ctx,
>> +                         struct perf_cpu_context *cpuctx)
>> {
>>      struct perf_event *event;
>> +    int group_added = 0;
>>      int can_add_hw = 1;
>> 
>> +    ctx->curr_rotation = ctx->num_rotations;
>> +    ctx->num_rotations++;
>> +
>>      list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>>              /* Ignore events in OFF or ERROR state */
>>              if (event->state <= PERF_EVENT_STATE_OFF)
>> @@ -3034,13 +3099,77 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>>              if (!event_filter_match(event))
>>                      continue;
>> 
>> +            if (event->rotation_id != PERF_ROTATION_ID_NOT_ASSGINED)
>> +                    continue;
>> +
>>              if (group_can_go_on(event, cpuctx, can_add_hw)) {
>>                      if (group_sched_in(event, cpuctx, ctx))
>>                              can_add_hw = 0;
>> +                    else {
>> +                            event->rotation_id = ctx->curr_rotation;
>> +                            ctx->nr_sched++;
>> +                            group_added++;
> 
> group_added is not used

I should have removed it. Thanks!

> 
> SNIP
> 
>> static int perf_rotate_context(struct perf_cpu_context *cpuctx)
>> {
>> -    struct perf_event_context *ctx = NULL;
>> +    struct perf_event_context *ctx = cpuctx->task_ctx;
>>      int rotate = 0;
>> +    u64 now;
>> 
>> -    if (cpuctx->ctx.nr_events) {
>> -            if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>> -                    rotate = 1;
>> -    }
>> -
>> -    ctx = cpuctx->task_ctx;
>> -    if (ctx && ctx->nr_events) {
>> -            if (ctx->nr_events != ctx->nr_active)
>> -                    rotate = 1;
>> -    }
>> +    if (!flexible_sched_done(cpuctx) ||
>> +        cpuctx->ctx.num_rotations > 1)
>> +            rotate = 1;
>> 
>>      if (!rotate)
>>              goto done;
>> @@ -3382,15 +3492,35 @@ static int perf_rotate_context(struct 
>> perf_cpu_context *cpuctx)
>>      perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>      perf_pmu_disable(cpuctx->ctx.pmu);
>> 
>> -    cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> +    update_context_time(&cpuctx->ctx);
>>      if (ctx)
>> -            ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>> +            update_context_time(ctx);
>> +    update_cgrp_time_from_cpuctx(cpuctx);
>> 
>> -    rotate_ctx(&cpuctx->ctx);
>> +    ctx_switch_rotation_out(&cpuctx->ctx, cpuctx);
>>      if (ctx)
>> -            rotate_ctx(ctx);
>> +            ctx_switch_rotation_out(ctx, cpuctx);
>> 
>> -    perf_event_sched_in(cpuctx, ctx, current);
>> +    if (flexible_sched_done(cpuctx)) {
>> +            /* simply repeat previous calculated rotations */
>> +            ctx_switch_rotation_in(&cpuctx->ctx, cpuctx);
>> +            if (ctx)
>> +                    ctx_switch_rotation_in(ctx, cpuctx);
>> +    } else {
>> +            /* create new rotation */
>> +            ctx_add_rotation(&cpuctx->ctx, cpuctx);
>> +            if (ctx)
>> +                    ctx_add_rotation(ctx, cpuctx);
>> +    }
> 
> that seems messy.. could this be done just by setting
> the rotation ID and let perf_event_sched_in skip over
> different IDs and sched-in/set un-assigned events?

Yeah, that could be a cleaner implementation. 

Thanks,
Song


Reply via email to