On Tue, 2025-01-21 at 09:08 -0800, Paul E. McKenney wrote: > On Sat, Jan 18, 2025 at 12:24:33AM +0100, Frederic Weisbecker wrote: > > hrtimers are migrated away from the dying CPU to any online target > > at > > the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth > > timers > > handling tasks involved in the CPU hotplug forward progress. > > > > However wake ups can still be performed by the outgoing CPU after > > CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers > > being armed. Depending on several considerations (crystal ball > > power management based election, earliest timer already enqueued, > > timer > > migration enabled or not), the target may eventually be the current > > CPU even if offline. If that happens, the timer is eventually > > ignored. > > > > The most notable example is RCU which had to deal with each and > > every of > > those wake-ups by deferring them to an online CPU, along with > > related > > workarounds: > > > > _ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying) > > _ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from > > offline CPU) > > _ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq) > > > > The problem isn't confined to RCU though as the stop machine > > kthread > > (which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the > > end > > of its work through cpu_stop_signal_done() and performs a wake up > > that > > eventually arms the deadline server timer: > > > > WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 > > hrtimer_start_range_ns+0x289/0x2d0 > > CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted > > Stopper: multi_cpu_stop+0x0/0x120 <- > > stop_machine_cpuslocked+0x66/0xc0 > > RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0 > > Call Trace: > > <TASK> > > ? hrtimer_start_range_ns > > start_dl_timer > > enqueue_dl_entity > > dl_server_start > > enqueue_task_fair > > enqueue_task > > ttwu_do_activate > > try_to_wake_up > > complete > > cpu_stopper_thread > > smpboot_thread_fn > > kthread > > ret_from_fork > > ret_from_fork_asm > > </TASK> > > > > Instead of providing yet another bandaid to work around the > > situation, > > fix it from hrtimers infrastructure instead: always migrate away a > > timer to an online target whenever it is enqueued from an offline > > CPU. > > > > This will also allow to revert all the above RCU disgraceful hacks. > > > > Reported-by: Vlad Poenaru <vlad.w...@gmail.com> > > Reported-by: Usama Arif <usamaarif...@gmail.com> > > Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from > > outgoing CPU earlier") > > Closes: 20241213203739.1519801-1-usamaarif...@gmail.com > > Signed-off-by: Frederic Weisbecker <frede...@kernel.org> > > Signed-off-by: Paul E. McKenney <paul...@kernel.org> > > This passes over-holiday testing rcutorture, so, perhaps redundantly: > > Tested-by: Paul E. McKenney <paul...@kernel.org>
Hi, I encountered the same issue even after applying this patch. Below are the details of the warning and call trace. migration/3: ------------[ cut here ]------------ migration/3: WARNING: CPU: 3 PID: 42 at kernel/time/hrtimer.c:1125 enqueue_hrtimer+0x7c/0xec migration/3: CPU: 3 UID: 0 PID: 42 Comm: migration/3 Tainted: G OE 6.12.18-android16-0-g59cb5a849beb-4k #1 0b440e43fa7b24aaa3b7e6e5d2b938948e0cacdb migration/3: Stopper: multi_cpu_stop+0x0/0x184 <- stop_machine_cpuslocked+0xc0/0x15c migration/3: Call trace: migration/3: enqueue_hrtimer+0x7c/0xec migration/3: hrtimer_start_range_ns+0x54c/0x7b4 migration/3: start_dl_timer+0x110/0x188 migration/3: enqueue_dl_entity+0x478/0x80c migration/3: dl_server_start+0xf8/0x194 migration/3: enqueue_task_fair+0x6c4/0x924 migration/3: enqueue_task+0x70/0x3a4 migration/3: do_activate_task+0xcc/0x178 migration/3: ttwu_do_activate+0xac/0x2e0 migration/3: try_to_wake_up+0x73c/0xaf4 migration/3: wake_up_process+0x18/0x28 migration/3: kick_pool+0xc4/0x170 migration/3: __queue_work+0x44c/0x698 migration/3: delayed_work_timer_fn+0x20/0x30 migration/3: call_timer_fn+0x4c/0x1d0 migration/3: __run_timer_base+0x278/0x350 migration/3: run_timer_softirq+0x78/0x9c migration/3: handle_softirqs+0x154/0x42c migration/3: __do_softirq+0x14/0x20 migration/3: ____do_softirq+0x10/0x20 migration/3: call_on_irq_stack+0x3c/0x74 migration/3: do_softirq_own_stack+0x1c/0x2c migration/3: __irq_exit_rcu+0x5c/0xd4 migration/3: irq_exit_rcu+0x10/0x1c migration/3: el1_interrupt+0x38/0x58 migration/3: el1h_64_irq_handler+0x18/0x24 migration/3: el1h_64_irq+0x68/0x6c migration/3: multi_cpu_stop+0x15c/0x184 migration/3: cpu_stopper_thread+0xf8/0x1b0 migration/3: smpboot_thread_fn+0x204/0x304 migration/3: kthread+0x110/0x1a4 migration/3: ret_from_fork+0x10/0x20 migration/3: ---[ end trace 0000000000000000 ]--- kworker/1:1: psci: CPU0 killed (polled 0 ms)