On Wed, May 17, 2017 at 12:40:10PM +0200, Peter Zijlstra wrote: > On Tue, May 16, 2017 at 07:27:42AM -0700, Paul E. McKenney wrote: > > On Tue, May 16, 2017 at 05:46:06AM -0700, Paul E. McKenney wrote: > > > > Something like this, yes. Maybe even exactly like this. ;-) > > > > Ah, one thing I forgot... If you are avoiding use of get_online_cpus(), > > you usually also have to be very careful about how you use things like > > cpu_online() and cpu_is_offline. > > OK, so I think I got it wrong there. This hunk should close any race > between perf_pmu_register() and hotplug by tracking a global state > protected by pmus_lock. Thereby insuring the cpuctx->online state gets > initialized right.
Looks much better! > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -8993,6 +8993,8 @@ static int pmu_dev_alloc(struct pmu *pmu > static struct lock_class_key cpuctx_mutex; > static struct lock_class_key cpuctx_lock; > > +static cpumask_var_t perf_online_mask; > + > int perf_pmu_register(struct pmu *pmu, const char *name, int type) > { > int cpu, ret; > @@ -9056,7 +9058,7 @@ int perf_pmu_register(struct pmu *pmu, c > lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex); > lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock); > cpuctx->ctx.pmu = pmu; > - cpuctx->online = cpu_online(cpu); > + cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask); > > __perf_mux_hrtimer_init(cpuctx, cpu); > } > @@ -10952,6 +10954,8 @@ static void __init perf_event_init_all_c > struct swevent_htable *swhash; > int cpu; > > + zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL); > + > for_each_possible_cpu(cpu) { > swhash = &per_cpu(swevent_htable, cpu); > mutex_init(&swhash->hlist_mutex); > @@ -11011,6 +11015,7 @@ static void perf_event_exit_cpu_context( > cpuctx->online = 0; > mutex_unlock(&ctx->mutex); > } > + cpumask_clear_cpu(cpu, perf_online_mask); > mutex_unlock(&pmus_lock); > } > #else > @@ -11028,6 +11033,7 @@ int perf_event_init_cpu(unsigned int cpu > perf_swevent_init_cpu(cpu); > > mutex_lock(&pmus_lock); > + cpumask_set_cpu(cpu, perf_online_mask); > list_for_each_entry(pmu, &pmus, entry) { > cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu); > ctx = &cpuctx->ctx; > > > > > --- > > > > --- a/kernel/events/core.c > > > > +++ b/kernel/events/core.c > > > > @@ -8997,7 +8997,6 @@ int perf_pmu_register(struct pmu *pmu, c > > > > { > > > > int cpu, ret; > > > > > > > > - get_online_cpus(); > > > > mutex_lock(&pmus_lock); > > > > ret = -ENOMEM; > > > > pmu->pmu_disable_count = alloc_percpu(int); > > > > > > There is usually also some state check in here somewhere for the CPU > > > being offline from a perf perspective. Such a check might already exist, > > > but I must plead ignorance of perf. > > This just allocates per-cpu storage, that is per definition on the > possible mask and unrelated to the online mask. Got it! > > > > @@ -9093,7 +9092,6 @@ int perf_pmu_register(struct pmu *pmu, c > > > > ret = 0; > > > > unlock: > > > > mutex_unlock(&pmus_lock); > > > > - put_online_cpus(); > > > > > > > > return ret; > > > > > > > > @@ -11002,10 +11000,9 @@ static void perf_event_exit_cpu_context( > > > > struct perf_cpu_context *cpuctx; > > > > struct perf_event_context *ctx; > > > > struct pmu *pmu; > > > > - int idx; > > > > > > > > - idx = srcu_read_lock(&pmus_srcu); > > > > - list_for_each_entry_rcu(pmu, &pmus, entry) { > > > > + mutex_lock(&pmus_lock); > > > > > > If the state change checked for by perf_pmu_register() needs to be also > > > guarded by ctx->mutex, this looks right to me. > > Right, so we have two locks, pmus_lock that serializes hotplug vs > perf_pmu_register and ctx->lock that serializes find_get_context() vs > hotplug. > > Together they ensure that if a PMU is observed, the PMU's cpuctx's have > the correct ->online state. Sounds good, and now the added pmus_lock allows dropping get_online_cpus(). > > > Just for completeness, the other style is to maintain separate per-CPU > > > state, in which case you would instead acquire pmus_lock, mark this > > > CPU off-limits to more perf_pmu_register() usage, release pmus_lock, > > > then clean up any old usage. > > I'm not immediately seeing the other style, but per the above, I need > that too. Because the previous could race against hotplug if > perf_pmu_register() would observe cpu_online() as set after > perf_event_exit_cpu() or something. > > With the above change any chance of a race is gone and we don't need to > worry about hotplug ordering too much. Nice!!! > Now I just need to do something about the swevent hash, because that's > got a hole in now. On the styles, here is a more coherence list: 1. Use get_online_cpus(), but no CPU-hotplug notifiers. You can use cpu_online() and cpu_is_offline() straightforwardly while get_online_cpus() is in effect, but these two may only be used for statistical/heuristic purposes otherwise. 2a. Use CPU-hotplug notifiers, but not get_online_cpus(). The notifiers maintain some sort of bitmask tracking which CPUs are present from the viewpoint of this subsystem. This bitmask provides exact CPU presence/absence indications, at least assuming the appropriate lock is held. The cpu_online() and cpu_is_offline() macros should be avoided, except possibly for statistical/heuristic purposes. For your perf patch, the bitmask is a cpumask_var_t. For RCU, it is the combination of the ->qsmaskinitnext fields of the leaf rcu_node structures. 2b. Like 2a, except that instead of a bitmask, the CPU online/offline information is implicit in other data structures. For example, per-CPU structures might be allocated only when the corresponding CPU is online, so that a given per-CPU pointer is non-NULL iff the corresponding CPU is online. So I was (confusingly) initially using "style" to distinguish #1 on the one hand from #2a and #2b on the other. Later on, I was using "style" to distinguish #2a from #2b. At this point, I am not sure that it makes all that much sense to distinguish 2a from 2b. Or you could argue that use of a cpumask_var_t is its own substyle, with hand-crafted bitmasks (such as RCU's) are separate substyles. Can't say that I care all that much about that fine-grained gradiations. ;-) Thanx, Paul