On Wed, May 13, 2020 at 07:35:25AM +0300, Konstantin Khlebnikov wrote: > On 13/05/2020 06.18, Paul E. McKenney wrote: > > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote: > > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: > > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > > > > > On 08/05/2020 17.46, Paul E. McKenney wrote: > > > > > > Easy for me to provide "start fast and inefficient mode" and "stop > > > > > > fast > > > > > > and inefficient mode" APIs for MM to call! > > > > > > > > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I > > > > > > would > > > > > > expect them not to nest (as in if you need them to nest, please let > > > > > > me know). I would not expect these to be invoked all that often > > > > > > (as in > > > > > > if you do need them to be fast and scalable, please let me know). > > > > > > > RCU would then be in fast/inefficient mode if either MM told it to > > > > > > be > > > > > > or if RCU had detected callback overload on at least one CPU. > > > > > > > > > > > > Seem reasonable? > > > > > > > > > > Not exactly nested calls, but kswapd threads are per numa node. > > > > > So, at some level nodes under pressure must be counted. > > > > > > > > Easy enough, especially given that RCU already "counts" CPUs having > > > > excessive numbers of callbacks. But assuming that the transitions > > > > to/from > > > > OOM are rare, I would start by just counting them with a global counter. > > > > If the counter is non-zero, RCU is in fast and inefficient mode. > > > > > > > > > Also forcing rcu calls only for cpus in one numa node might be useful. > > > > > > > > Interesting. RCU currently evaluates a given CPU by comparing the > > > > number of callbacks against a fixed cutoff that can be set at boot using > > > > rcutree.qhimark, which defaults to 10,000. When this cutoff is > > > > exceeded, > > > > RCU becomes more aggressive about invoking callbacks on that CPU, for > > > > example, by sacrificing some degree of real-time response. I believe > > > > that this heuristic would also serve the OOM use case well. > > > > > > So one of the things that I'm not sure people have connected here is > > > that memory reclaim done by shrinkers is one of the things that > > > drives huge numbers of call_rcu() callbacks to free memory via rcu. > > > If we are reclaiming dentries and inodes, then we can be pushing > > > thousands to hundreds of thousands of objects into kfree_rcu() > > > and/or direct call_rcu() calls to free these objects in a single > > > reclaim pass. > > > > Good point! > > Indeed > > > > > > Hence the trigger for RCU going into "excessive callback" mode > > > might, in fact, be kswapd running a pass over the shrinkers. i.e. > > > memory reclaim itself can be responsible for pushing RCU into this "OOM > > > pressure" situation. > > > > > > So perhaps we've missed a trick here by not having the memory > > > reclaim routines trigger RCU callbacks at the end of a priority > > > scan. The shrinkers have queued the objects for freeing, but they > > > haven't actually been freed yet and so things like slab pages > > > haven't actually been returned to the free pool even though the > > > shrinkers have said "freed this many objects"... > > > > > > i.e. perhaps the right solution here is a "rcu_run_callbacks()" > > > function that memory reclaim calls before backing off and/or winding > > > up reclaim priority. > > > > It would not be hard to make something that put RCU into fast/inefficient > > mode for a couple of grace periods. I will also look into the possibility > > of speeding up callback invocation. > > > > It might also make sense to put RCU grace periods into fast mode while > > running the shrinkers that are freeing dentries and inodes. However, > > kbuild test robot reports ugly regressions when putting RCU into > > fast/inefficient mode to quickly and too often. As in 78.5% degradation > > on one of the benchmarks. > > I think fast/inefficient mode here just an optimization for freeing > memory faster. It doesn't solve the problem itself. > > At first we have to close the loop in reclaimer and actually wait or run > rcu callbacks which might free memory before increasing priority and > invoking OOM killer.
That is easy, just invoke rcu_barrier(), which will block until all prior call_rcu()/kfree_rcu() callbacks have been invoked. > > > > > I wonder if direct-reclaim should at some stage simply wait for RCU > > > > > QS. > > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM. > > > > > > > > The rcu_oom_count() function in the patch starting this thread returns > > > > the > > > > total number of outstanding callbacks queued on all CPUs. So one > > > > approach > > > > would be to invoke this function, and if the return value was truly > > > > huge (taking size of memory and who knows that all else into account), > > > > do the rcu_barrier() to wait for RCU to clear its current backlog. > > > > > > The shrinker scan control structure has a node mask in it to > > > indicate what node (and hence CPUs) it should be reclaiming from. > > > This information comes from the main reclaim scan routine, so it > > > would be trivial to feed straight into the RCU code to have it > > > act on just the CPUs/node that we are reclaiming memory from... > > > > For the callbacks, RCU can operate on CPUs, in theory anyway. The > > grace period itself, however, is inherently global. > > > > > > On the NUMA point, it would be dead easy for me to supply a function > > > > that returned the number of callbacks on a given CPU, which would allow > > > > you to similarly evaluate a NUMA node, a cgroup, or whatever. > > > > > > I'd think it runs the other way around - we optimisitically call the > > > RCU layer to do cleanup, and the RCU layer decides if there's enough > > > queued callbacks on the cpus/node to run callbacks immediately. It > > > would even be provided with the scan priority to indicate the level > > > of desperation memory reclaim is under.... > > > > Easy for RCU to count the number of callbacks. That said, it has no > > idea which callbacks are which. Perhaps kfree_rcu() could gather that > > information from the slab allocator, though. > > It's simple to mark slab shrinkers that frees object through RCU and > count freed objects in reclaimer: > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -536,6 +536,9 @@ static unsigned long do_shrink_slab(struct shrink_control > *shrinkctl, > else > new_nr = atomic_long_read(&shrinker->nr_deferred[nid]); > > + if (shrinker->flags & SHRINKER_KFREE_RCU) > + shrinkctl->nr_kfree_rcu += freed; > + > trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, > total_scan); > return freed; > } > > And when accumulated enough do some synchronization. > > Probably it's better to sum freed objects at per-cpu variable, > and accumulate size rather than count. RCU currently has no notion of size outside of possibly kfree_rcu(), so that would be new information to RCU. Thanx, Paul > > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > > > > > in page_alloc shouldn't block RCU and doesn't need special care. > > > > > > > > I must defer to you guys on this. The main caution is the duration of > > > > direct reclaim. After all, if it is too long, the kfree_rcu() instance > > > > would have been better of just invoking synchronize_rcu(). > > > > > > Individual callers of kfree_rcu() have no idea of the load on RCU, > > > nor how long direct reclaim is taking. Calling synchronize_rcu() > > > incorrectly has pretty major downsides to it, so nobody should be > > > trying to expedite kfree_rcu() unless there is a good reason to do > > > so (e.g. at unmount to ensure everything allocated by a filesystem > > > has actually been freed). Hence I'd much prefer the decision to > > > expedite callbacks is made by the RCU subsystem based on it's known > > > callback load and some indication of how close memory reclaim is to > > > declaring OOM... > > > > Sorry, I was unclear. There is a new single-argument kfree_rcu() under > > way that does not require an rcu_head in the structure being freed. > > However, in this case, kfree_rcu() might either allocate the memory > > that is needed to track the memory to be freed on the one hand or just > > invoke synchronize_rcu() on the other. So this decision would be taken > > inside kfree_rcu(), and not be visible to either core RCU or the caller > > of kfree_rcu(). > > > > This decision is made based on whether or not the allocator provides > > kfree_rcu() the memory needed. The tradeoff is what GFP flags are > > supplied. So the question kfree_rcu() has to answer is "Would it be > > better to give myself to reclaim as an additional task, or would it > > instead be better to just invoke synchronize_rcu() and then immediately > > free()?" > > > > I am probably still unclear, but hopefully at least one step in the > > right direction. > > > > Thanx, Paul > >