* Frederic Weisbecker <frede...@kernel.org> wrote: > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to > keep the scheduler stats alive. However this residual tick is a burden > for bare metal tasks that can't stand any interruption at all, or want > to minimize them. > > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now > outsource these scheduler ticks to the global workqueue so that a > housekeeping CPU handles those remotely. The sched_class::task_tick() > implementations have been audited and look safe to be called remotely > as the target runqueue and its current task are passed in parameter > and don't seem to be accessed locally. > > Note that in the case of using isolcpus, it's still up to the user to > affine the global workqueues to the housekeeping CPUs through > /sys/devices/virtual/workqueue/cpumask or domains isolation > "isolcpus=nohz,domain". > > Signed-off-by: Frederic Weisbecker <frede...@kernel.org> > Cc: Chris Metcalf <cmetc...@mellanox.com> > Cc: Christoph Lameter <c...@linux.com> > Cc: Luiz Capitulino <lcapitul...@redhat.com> > Cc: Mike Galbraith <efa...@gmx.de> > Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Rik van Riel <r...@redhat.com> > Cc: Thomas Gleixner <t...@linutronix.de> > Cc: Wanpeng Li <kernel...@gmail.com> > Cc: Ingo Molnar <mi...@kernel.org> > --- > kernel/sched/core.c | 91 > +++++++++++++++++++++++++++++++++++++++++++++++- > kernel/sched/isolation.c | 4 +++ > kernel/sched/sched.h | 2 ++ > 3 files changed, 96 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fc9fa25..5c0e8b6 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3120,7 +3120,94 @@ u64 scheduler_tick_max_deferment(void) > > return jiffies_to_nsecs(next - now); > } > -#endif > + > +struct tick_work { > + int cpu; > + struct delayed_work work; > +}; > + > +static struct tick_work __percpu *tick_work_cpu; > + > +static void sched_tick_remote(struct work_struct *work) > +{ > + struct delayed_work *dwork = to_delayed_work(work); > + struct tick_work *twork = container_of(dwork, struct tick_work, work); > + int cpu = twork->cpu; > + struct rq *rq = cpu_rq(cpu); > + struct rq_flags rf; > + > + /* > + * Handle the tick only if it appears the remote CPU is running > + * in full dynticks mode. The check is racy by nature, but > + * missing a tick or having one too much is no big deal.
I'd suggest pointing out why it's no big deal: * missing a tick or having one too much is no big deal, * because the scheduler tick updates statistics and checks * timeslices in a time-independent way, regardless of when * exactly it is running. > + */ > + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) { > + struct task_struct *curr; > + u64 delta; > + > + rq_lock_irq(rq, &rf); > + update_rq_clock(rq); > + curr = rq->curr; > + delta = rq_clock_task(rq) - curr->se.exec_start; > + /* Make sure we tick in a reasonable amount of time */ > + WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3); Please add a newline before the comment, and I'd also suggest this wording: /* Make sure the next tick runs within a reasonable amount of time: */ > + /* > + * Perform remote tick every second. The arbitrary frequence is > + * large enough to avoid overload and short enough to keep sched > + * internal stats alive. > + */ > + queue_delayed_work(system_unbound_wq, dwork, HZ); > +} Typo. I'd also suggest somewhat clearer wording: /* * Run the remote tick once per second (1Hz). This arbitrary * frequency is large enough to avoid overload but short enough * to keep scheduler internal stats reasonably up to date. */ > +#ifdef CONFIG_HOTPLUG_CPU > +static void sched_tick_stop(int cpu) > +{ > + struct tick_work *twork; > + > + if (housekeeping_cpu(cpu, HK_FLAG_TICK)) > + return; > + > + WARN_ON_ONCE(!tick_work_cpu); > + > + twork = per_cpu_ptr(tick_work_cpu, cpu); > + cancel_delayed_work_sync(&twork->work); > +} > +#endif /* CONFIG_HOTPLUG_CPU */ > + > +int __init sched_tick_offload_init(void) > +{ > + tick_work_cpu = alloc_percpu(struct tick_work); > + if (!tick_work_cpu) { > + pr_err("Can't allocate remote tick struct\n"); > + return -ENOMEM; Printing a warning is not enough. If tick_work_cpu ends up being NULL, then the tick will crash AFAICS, due to: > + twork = per_cpu_ptr(tick_work_cpu, cpu); > + cancel_delayed_work_sync(&twork->work); ... it's much better to crash straight away - i.e. we should use panic(). > +#else > +static void sched_tick_start(int cpu) { } > +static void sched_tick_stop(int cpu) { } > +#endif /* CONFIG_NO_HZ_FULL */ So if we are using #if/else/endif markers, please use them in the #else branch when it's so short, where they are actually useful: > +#else /* !CONFIG_NO_HZ_FULL: */ > +static void sched_tick_start(int cpu) { } > +static void sched_tick_stop(int cpu) { } > +#endif (also note the inversion) Thanks, Ingo