On Thu, Apr 24, 2025 at 11:30:39AM +0200, Benjamin Berg wrote:
> On Thu, 2025-04-24 at 11:05 +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-04-23 11:16:49 [-0700], Paul E. McKenney wrote:
> > > On Wed, Apr 23, 2025 at 05:17:31PM +0200, Benjamin Berg wrote:
> > > > Hi,
> > > > 
> > > > On Wed, 2025-01-08 at 10:04 +0100, Sebastian Andrzej Siewior wrote:
> > > > > __module_text_address() can be invoked within a RCU section, there is 
> > > > > no
> > > > > requirement to have preemption disabled.
> > > > > 
> > > > > Replace the preempt_disable() section around __module_text_address()
> > > > > with RCU.
> > > > 
> > > > Unfortunately, this patch causes a performance regression for us. The
> > > > trouble is that we enable kmemleak and run trace-cmd so a lot of stack
> > > > traces need to be collected. Obviously, we also have lockdep enabled.
> > > > 
> > > > Now, combine this with the UML stack dumping code calling into
> > > > __kernel_text_address a lot[1] and it really has a relevant performance
> > > > impact. I saw the kernel spending 40% of its own CPU time just on the
> > > > lock in is_module_text_address.
> > > > 
> > > > Maybe kernel_text_address should leave the RCU handling to the caller
> > > > and assume that the RCU read lock is already taken?
> > > > 
> > > > Benjamin
> > > > 
> > > > [1] The UM arch dump_stack function reads every "unsigned long" on the
> > > > stack and tests it using __kernel_text_address.
> > > 
> > > Use of a single guard(rcu)() is regressing performance?  Interesting and
> > > quite unexpected.  That said, tiven the amount of debug you have enabled,
> > > I am not so sure that people are going to be all that excited about a
> > > further performance regression.
> > > 
> > > But is this regression due to the cleanup hook that guard(rcu)()
> > > registers?  If so, please feel free to try using rcu_read_lock()
> > > and rcu_read_unlock() instead.  I would be surprised if this makes a
> > > difference, but then again, your initial regression report also comes
> > > as a surprise, so...
> > > 
> > > Another way to reduce guard(rcu)() overhead is to build your kernel
> > > with CONFIG_PREEMPT_NONE=y.  Not so good for real-time response, but
> > > then again, neither are your debug options.
> > 
> > The guard notation is not regression I guess it is just the plenty of
> > rcu_read_lock()/ unlock(). We had one regression which was "fixed" by
> > commit ee57ab5a32129 ("locking/lockdep: Disable KASAN instrumentation of 
> > lockdep.c").
> 
> Yup, we really pretty much created a micro-benchmark for grabbing stack
> traces.
> 
> > My guess would be that this is a preemptible kernel and the preempt
> > disable/ enable is cheaper that the RCU version. So going back to a
> > non-preemtible kernel should "fix" it.
> 
> Yes, preempt_disable() is extremely cheap.
> 
> > Looking at kernel_text_address(), is_bpf_text_address() has also a
> > RCU read section so probably subject to the same trouble. And
> > is_ftrace_trampoline() could be also converted to RCU which would
> > increase the trouble. 
> > 
> > Improving the stack trace on UM or caching some of the most common one
> > might help. Not sure if disabling kmemleak for lockdep is possible/
> > makes a difference.
> 
> What does seem to help is to simply disable lockdep inside dump_trace.
> That should be good enough for us at least, bringing the overhead down
> to a manageable amount when running these tests.
> Some unscientific numbers:
> 
> config                               dump_trace     locking
> ----
> no locking (preempt_disable)            6 %         -
> guard(rcu)() + lockdep_off             15 %         58 % of that
> rcu_read_lock + lockdep_off            17 %         60 % of that
> guard(rcu)()                           48 %         91 % of that
> 
> That confirms that guard(rcu)() really is not a problem. There might be
> slight overhead, but it is probably within the margin of error. Turning
> lockdep off/on inside the UML dump_trace() function brings down the
> overhead a lot and I guess that should be an acceptable level for us.

Whew!!!  ;-)

> Not sure if something like that would be desirable upstream. This is
> happening for us when running the hostap "hwsim" tests inside UML (with
> time-travel). At least internally, we could carry a custom patch to add
> the lockdep_off()/lockdep_on() to dump_trace in order to work around
> it[1].

That makes sense to me, but I am not the maintainer of that code.  ;-)

                                                        Thanx, Paul

> Benjamin
> 
> [1] Actually, now I am reminded that we already have that for kmemleak
> as lockdep was considerably slowing down the scanning.
> 
> > 
> > >                                                   Thanx, Paul
> > 
> > Sebastian
> > 
> 

Reply via email to