On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote: > Hi, > > I'm currently working on making nohz_full/nohz_idle runtime toggable > and some other people seem to be interested as well. So I've dumped > a few thoughts about some pre-requirements to achieve that for those > interested. > > As you can see, there is a bit of hard work in the way. I'm iterating > that in https://pad.kernel.org/p/isolation, feel free to edit: > > > == RCU nocb == > > Currently controllable with "rcu_nocbs=" boot parameter and/or through > nohz_full=/isolcpus=nohz > We need to make it toggeable at runtime. Currently handling that: > v1: https://lwn.net/Articles/820544/ > v2: coming soon
Looking forward to seeing it! > == TIF_NOHZ == > > Need to get rid of that in order not to trigger syscall slowpath on CPUs that > don't want nohz_full. > Also we don't want to iterate all threads and clear the flag when the last > nohz_full CPU exits nohz_full > mode. Prefer static keys to call context tracking on archs. x86 does that > well. Would it help if RCU was able to, on a per-CPU basis, distinguish between nohz_full userspace execution on the one hand and idle-loop execution on the other? Or do you have some other trick in mind? Thanx, Paul > == Proper entry code == > > We must make sure that a given arch never calls exception_enter() / > exception_exit(). > This saves the previous state of context tracking and switch to kernel mode > (from context tracking POV) > temporarily. Since this state is saved on the stack, this prevents us from > turning off context tracking > entirely on a CPU: The tracking must be done on all CPUs and that takes some > cycles. > > This means that, considering early entry code (before the call to context > tracking upon kernel entry, > and after the call to context tracking upon kernel exit), we must take care > of few things: > > 1) Make sure early entry code can't trigger exceptions. Or if it does, the > given exception can't schedule > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must > call exception_enter()/exception_exit() > which we don't want. > > 2) No call to schedule_user(). > > 3) Make sure early entry code is not interruptible or preempt_schedule_irq() > would rely on > exception_entry()/exception_exit() > > 4) Make sure early entry code can't be traced (no call to > preempt_schedule_notrace()), or if it does it > can't schedule > > I believe x86 does most of that well. In the end we should remove > exception_enter()/exit implementations > in x86 and replace it with a check that makes sure context_tracking state is > not in USER. An arch meeting > all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. > Being able to toggle nohz_full > at runtime would depend on that. > > > == Cputime accounting == > > Both write and read side must switch to tick based accounting and drop the > use of seqlock in task_cputime(), > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state > machine is required to make that without races. > > == Nohz == > > Switch from nohz_full to nohz_idle. Mind a few details: > > 1) Turn off 1Hz offlined tick handled in housekeeping > 2) Handle tick dependencies, take care of racing CPUs setting/clearing > tick dependency. It's much trickier when > we switch from nohz_idle to nohz_full > > == Unbound affinity == > > Restore kernel threads, workqueue, timers, etc... wide affinity. But take > care of cpumasks that have been set through other > interfaces: sysfs, procfs, etc...