On Thu, Apr 24, 2025 at 11:30:39AM +0200, Benjamin Berg wrote: > On Thu, 2025-04-24 at 11:05 +0200, Sebastian Andrzej Siewior wrote: > > On 2025-04-23 11:16:49 [-0700], Paul E. McKenney wrote: > > > On Wed, Apr 23, 2025 at 05:17:31PM +0200, Benjamin Berg wrote: > > > > Hi, > > > > > > > > On Wed, 2025-01-08 at 10:04 +0100, Sebastian Andrzej Siewior wrote: > > > > > __module_text_address() can be invoked within a RCU section, there is > > > > > no > > > > > requirement to have preemption disabled. > > > > > > > > > > Replace the preempt_disable() section around __module_text_address() > > > > > with RCU. > > > > > > > > Unfortunately, this patch causes a performance regression for us. The > > > > trouble is that we enable kmemleak and run trace-cmd so a lot of stack > > > > traces need to be collected. Obviously, we also have lockdep enabled. > > > > > > > > Now, combine this with the UML stack dumping code calling into > > > > __kernel_text_address a lot[1] and it really has a relevant performance > > > > impact. I saw the kernel spending 40% of its own CPU time just on the > > > > lock in is_module_text_address. > > > > > > > > Maybe kernel_text_address should leave the RCU handling to the caller > > > > and assume that the RCU read lock is already taken? > > > > > > > > Benjamin > > > > > > > > [1] The UM arch dump_stack function reads every "unsigned long" on the > > > > stack and tests it using __kernel_text_address. > > > > > > Use of a single guard(rcu)() is regressing performance? Interesting and > > > quite unexpected. That said, tiven the amount of debug you have enabled, > > > I am not so sure that people are going to be all that excited about a > > > further performance regression. > > > > > > But is this regression due to the cleanup hook that guard(rcu)() > > > registers? If so, please feel free to try using rcu_read_lock() > > > and rcu_read_unlock() instead. I would be surprised if this makes a > > > difference, but then again, your initial regression report also comes > > > as a surprise, so... > > > > > > Another way to reduce guard(rcu)() overhead is to build your kernel > > > with CONFIG_PREEMPT_NONE=y. Not so good for real-time response, but > > > then again, neither are your debug options. > > > > The guard notation is not regression I guess it is just the plenty of > > rcu_read_lock()/ unlock(). We had one regression which was "fixed" by > > commit ee57ab5a32129 ("locking/lockdep: Disable KASAN instrumentation of > > lockdep.c"). > > Yup, we really pretty much created a micro-benchmark for grabbing stack > traces. > > > My guess would be that this is a preemptible kernel and the preempt > > disable/ enable is cheaper that the RCU version. So going back to a > > non-preemtible kernel should "fix" it. > > Yes, preempt_disable() is extremely cheap. > > > Looking at kernel_text_address(), is_bpf_text_address() has also a > > RCU read section so probably subject to the same trouble. And > > is_ftrace_trampoline() could be also converted to RCU which would > > increase the trouble. > > > > Improving the stack trace on UM or caching some of the most common one > > might help. Not sure if disabling kmemleak for lockdep is possible/ > > makes a difference. > > What does seem to help is to simply disable lockdep inside dump_trace. > That should be good enough for us at least, bringing the overhead down > to a manageable amount when running these tests. > Some unscientific numbers: > > config dump_trace locking > ---- > no locking (preempt_disable) 6 % - > guard(rcu)() + lockdep_off 15 % 58 % of that > rcu_read_lock + lockdep_off 17 % 60 % of that > guard(rcu)() 48 % 91 % of that > > That confirms that guard(rcu)() really is not a problem. There might be > slight overhead, but it is probably within the margin of error. Turning > lockdep off/on inside the UML dump_trace() function brings down the > overhead a lot and I guess that should be an acceptable level for us.
Whew!!! ;-) > Not sure if something like that would be desirable upstream. This is > happening for us when running the hostap "hwsim" tests inside UML (with > time-travel). At least internally, we could carry a custom patch to add > the lockdep_off()/lockdep_on() to dump_trace in order to work around > it[1]. That makes sense to me, but I am not the maintainer of that code. ;-) Thanx, Paul > Benjamin > > [1] Actually, now I am reminded that we already have that for kmemleak > as lockdep was considerably slowing down the scanning. > > > > > > Thanx, Paul > > > > Sebastian > > >