On Fri, May 31, 2013 at 05:27:49PM -0400, Steven Rostedt wrote: > Paul, > > I've been debugging the last couple of days why my tests have been > locking up. One of my tracing tests, runs all available tracers. The > lockup always happened with the mmiotrace, which is used to trace > interactions between priority drivers and the kernel. But to do this > easily, when the tracer gets registered, it disables all but the boot > CPUs. The lockup always happened after it got done disabling the CPUs. > > Then I decided to try this: > > while :; do > for i in 1 2 3; do > echo 0 > /sys/devices/system/cpu/cpu$i/online > done > for i in 1 2 3; do > echo 1 > /sys/devices/system/cpu/cpu$i/online > done > done > > Well, sure enough, that locked up too, with the same users. Doing a > sysrq-w (showing all blocked tasks):
Impressive debugging!!! And that is what I call one gnarly deadlock! Your patch looks like it should fix the problem, but my immediate reaction was that it would be simpler to have rcu_gp_init() do either cpu_maps_update_begin(), get_online_cpus(), or cpu_hotplug_begin() if CONFIG_PROVE_RCU_DELAY instead of the current mutex_lock(&rsp->onoff_mutex). (My first choice would be get_online_cpus(), but I am not sure that I fully understand the deadlock.) Or am I missing something about the nature of this deadlock? One concern is that if I made that change, and if any hotplug notifier waited for a grace period, there would be another deadlock. Which might well be why this acquires ->onoff_lock. Hmmm... OK, another possible simplification would be to use udelay() or something similar to do the waiting, and maybe dial down the delay from the current two jiffies to (say) 200 microseconds. I could adjust the "if" condition to make the delay more probable to get roughly the same testing intensity as the current code has. Other thoughts? Thanx, Paul > [ 2991.344562] task PC stack pid father > [ 2991.344562] rcu_preempt D ffff88007986fdf8 0 10 2 > 0x00000000 > [ 2991.344562] ffff88007986fc98 0000000000000002 ffff88007986fc48 > 0000000000000908 > [ 2991.344562] ffff88007986c280 ffff88007986ffd8 ffff88007986ffd8 > 00000000001d3c80 > [ 2991.344562] ffff880079248a40 ffff88007986c280 0000000000000000 > 00000000fffd4295 > [ 2991.344562] Call Trace: > [ 2991.344562] [<ffffffff815437ba>] schedule+0x64/0x66 > [ 2991.344562] [<ffffffff81541750>] schedule_timeout+0xbc/0xf9 > [ 2991.344562] [<ffffffff8154bec0>] ? ftrace_call+0x5/0x2f > [ 2991.344562] [<ffffffff81049513>] ? cascade+0xa8/0xa8 > [ 2991.344562] [<ffffffff815417ab>] > schedule_timeout_uninterruptible+0x1e/0x20 > [ 2991.344562] [<ffffffff810c980c>] rcu_gp_kthread+0x502/0x94b > [ 2991.344562] [<ffffffff81062791>] ? __init_waitqueue_head+0x50/0x50 > [ 2991.344562] [<ffffffff810c930a>] ? rcu_gp_fqs+0x64/0x64 > [ 2991.344562] [<ffffffff81061cdb>] kthread+0xb1/0xb9 > [ 2991.344562] [<ffffffff81091e31>] ? lock_release_holdtime.part.23+0x4e/0x55 > [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58 > [ 2991.344562] [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0 > [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58 > [ 2991.344562] kworker/0:1 D ffffffff81a30680 0 47 2 > 0x00000000 > [ 2991.344562] Workqueue: events cpuset_hotplug_workfn > [ 2991.344562] ffff880078dbbb58 0000000000000002 0000000000000006 > 00000000000000d8 > [ 2991.344562] ffff880078db8100 ffff880078dbbfd8 ffff880078dbbfd8 > 00000000001d3c80 > [ 2991.344562] ffff8800779ca5c0 ffff880078db8100 ffffffff81541fcf > 0000000000000000 > [ 2991.344562] Call Trace: > [ 2991.344562] [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609 > [ 2991.344562] [<ffffffff815437ba>] schedule+0x64/0x66 > [ 2991.344562] [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24 > [ 2991.344562] [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609 > [ 2991.344562] [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50 > [ 2991.344562] [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50 > [ 2991.344562] [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40 > [ 2991.344562] [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50 > [ 2991.344562] [<ffffffff810af7e6>] rebuild_sched_domains_locked+0x6e/0x3a8 > [ 2991.344562] [<ffffffff810b0ec6>] rebuild_sched_domains+0x1c/0x2a > [ 2991.344562] [<ffffffff810b109b>] cpuset_hotplug_workfn+0x1c7/0x1d3 > [ 2991.344562] [<ffffffff810b0ed9>] ? cpuset_hotplug_workfn+0x5/0x1d3 > [ 2991.344562] [<ffffffff81058e07>] process_one_work+0x2d4/0x4d1 > [ 2991.344562] [<ffffffff81058d3a>] ? process_one_work+0x207/0x4d1 > [ 2991.344562] [<ffffffff8105964c>] worker_thread+0x2e7/0x3b5 > [ 2991.344562] [<ffffffff81059365>] ? rescuer_thread+0x332/0x332 > [ 2991.344562] [<ffffffff81061cdb>] kthread+0xb1/0xb9 > [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58 > [ 2991.344562] [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0 > [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58 > [ 2991.344562] bash D ffffffff81a4aa80 0 2618 2612 > 0x10000000 > [ 2991.344562] ffff8800379abb58 0000000000000002 0000000000000006 > 0000000000000c2c > [ 2991.344562] ffff880077fea140 ffff8800379abfd8 ffff8800379abfd8 > 00000000001d3c80 > [ 2991.344562] ffff8800779ca5c0 ffff880077fea140 ffffffff81541fcf > 0000000000000000 > [ 2991.344562] Call Trace: > [ 2991.344562] [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609 > [ 2991.344562] [<ffffffff815437ba>] schedule+0x64/0x66 > [ 2991.344562] [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24 > [ 2991.344562] [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609 > [ 2991.344562] [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e > [ 2991.344562] [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e > [ 2991.344562] [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40 > [ 2991.344562] [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e > [ 2991.344562] [<ffffffff81091c99>] ? __lock_is_held+0x32/0x53 > [ 2991.344562] [<ffffffff81548912>] notifier_call_chain+0x6b/0x98 > [ 2991.344562] [<ffffffff810671fd>] __raw_notifier_call_chain+0xe/0x10 > [ 2991.344562] [<ffffffff8103cf64>] __cpu_notify+0x20/0x32 > [ 2991.344562] [<ffffffff8103cf8d>] cpu_notify_nofail+0x17/0x36 > [ 2991.344562] [<ffffffff815225de>] _cpu_down+0x154/0x259 > [ 2991.344562] [<ffffffff81522710>] cpu_down+0x2d/0x3a > [ 2991.344562] [<ffffffff81526351>] store_online+0x4e/0xe7 > [ 2991.344562] [<ffffffff8134d764>] dev_attr_store+0x20/0x22 > [ 2991.344562] [<ffffffff811b3c5f>] sysfs_write_file+0x108/0x144 > [ 2991.344562] [<ffffffff8114c5ef>] vfs_write+0xfd/0x158 > [ 2991.344562] [<ffffffff8114c928>] SyS_write+0x5c/0x83 > [ 2991.344562] [<ffffffff8154c494>] tracesys+0xdd/0xe2 > > As well as held locks: > > [ 3034.728033] Showing all locks held in the system: > [ 3034.728033] 1 lock held by rcu_preempt/10: > [ 3034.728033] #0: (rcu_preempt_state.onoff_mutex){+.+...}, at: > [<ffffffff810c9471>] rcu_gp_kthread+0x167/0x94b > [ 3034.728033] 4 locks held by kworker/0:1/47: > [ 3034.728033] #0: (events){.+.+.+}, at: [<ffffffff81058d3a>] > process_one_work+0x207/0x4d1 > [ 3034.728033] #1: (cpuset_hotplug_work){+.+.+.}, at: [<ffffffff81058d3a>] > process_one_work+0x207/0x4d1 > [ 3034.728033] #2: (cpuset_mutex){+.+.+.}, at: [<ffffffff810b0ec1>] > rebuild_sched_domains+0x17/0x2a > [ 3034.728033] #3: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103d11b>] > get_online_cpus+0x3c/0x50 > [ 3034.728033] 1 lock held by mingetty/2563: > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: > [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8 > [ 3034.728033] 1 lock held by mingetty/2565: > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: > [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8 > [ 3034.728033] 1 lock held by mingetty/2569: > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: > [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8 > [ 3034.728033] 1 lock held by mingetty/2572: > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: > [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8 > [ 3034.728033] 1 lock held by mingetty/2575: > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: > [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8 > [ 3034.728033] 7 locks held by bash/2618: > [ 3034.728033] #0: (sb_writers#5){.+.+.+}, at: [<ffffffff8114bc3f>] > file_start_write+0x2a/0x2c > [ 3034.728033] #1: (&buffer->mutex#2){+.+.+.}, at: [<ffffffff811b3b93>] > sysfs_write_file+0x3c/0x144 > [ 3034.728033] #2: (s_active#54){.+.+.+}, at: [<ffffffff811b3c3e>] > sysfs_write_file+0xe7/0x144 > [ 3034.728033] #3: (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: > [<ffffffff810217c2>] cpu_hotplug_driver_lock+0x17/0x19 > [ 3034.728033] #4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8103d196>] > cpu_maps_update_begin+0x17/0x19 > [ 3034.728033] #5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103cfd8>] > cpu_hotplug_begin+0x2c/0x6d > [ 3034.728033] #6: (rcu_preempt_state.onoff_mutex){+.+...}, at: > [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e > [ 3034.728033] 1 lock held by bash/2980: > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: > [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8 > > Things looked a little weird. Also, this is a deadlock that lockdep did > not catch. But what we have here does not look like a circular lock > issue: > > Bash is blocked in rcu_cpu_notify(): > > 1961 /* Exclude any attempts to start a new grace period. */ > 1962 mutex_lock(&rsp->onoff_mutex); > > > kworker is blocked in get_online_cpus(), which makes sense as we are > currently taking down a CPU. > > But rcu_preempt is not blocked on anything. It is simply sleeping in > rcu_gp_kthread (really rcu_gp_init) here: > > 1453 #ifdef CONFIG_PROVE_RCU_DELAY > 1454 if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 && > 1455 system_state == SYSTEM_RUNNING) > 1456 schedule_timeout_uninterruptible(2); > 1457 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */ > > And it does this while holding the onoff_mutex that bash is waiting for. > > Doing a function trace, it showed me where it happened: > > [ 125.940066] rcu_pree-10 3.... 28384115273: > schedule_timeout_uninterruptible <-rcu_gp_kthread > [...] > [ 125.940066] rcu_pree-10 3d..3 28384202439: sched_switch: > prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=D ==> > next_comm=watchdog/3 next_pid=38 next_prio=120 > > The watchdog ran, and then: > > [ 125.940066] watchdog-38 3d..3 28384692863: sched_switch: > prev_comm=watchdog/3 prev_pid=38 prev_prio=120 prev_state=P ==> > next_comm=modprobe next_pid=2848 next_prio=118 > > Not sure what modprobe was doing, but shortly after that: > > [ 125.940066] modprobe-2848 3d..3 28385041749: sched_switch: > prev_comm=modprobe prev_pid=2848 prev_prio=118 prev_state=R+ ==> > next_comm=migration/3 next_pid=40 next_prio=0 > > Where the migration thread took down the CPU: > > [ 125.940066] migratio-40 3d..3 28389148276: sched_switch: > prev_comm=migration/3 prev_pid=40 prev_prio=0 prev_state=P ==> > next_comm=swapper/3 next_pid=0 next_prio=120 > > which finally did: > > [ 125.940066] <idle>-0 3...1 28389282142: arch_cpu_idle_dead > <-cpu_startup_entry > [ 125.940066] <idle>-0 3...1 28389282548: native_play_dead > <-arch_cpu_idle_dead > [ 125.940066] <idle>-0 3...1 28389282924: play_dead_common > <-native_play_dead > [ 125.940066] <idle>-0 3...1 28389283468: idle_task_exit > <-play_dead_common > [ 125.940066] <idle>-0 3...1 28389284644: amd_e400_remove_cpu > <-play_dead_common > > > CPU 3 is now offline, the rcu_preempt thread that ran on CPU 3 is still > doing a schedule_timeout_uninterruptible() and it registered it's > timeout to the timer base for CPU 3. You would think that it would get > migrated right? The issue here is that the timer migration happens at > the CPU notifier for CPU_DEAD. The problem is that the rcu notifier for > CPU_DOWN is blocked waiting for the onoff_mutex to be released, which is > held by the thread that just put itself into a uninterruptible sleep, > that wont wake up until the CPU_DEAD notifier of the timer > infrastructure is called, which wont happen until the rcu notifier > finishes. Here's our deadlock! > > Not sure how to solve this. I added this hack. Although, even though > it's a hack, it's a fix that's for a hack that is used for debugging > purposes only. > > I added a waitqueue and a flag. The flag gets set when a cpu is going > down and cleared after it is down or dead. The rcu_preempt thread will > add itself to the waitqueue and then check if any CPU is going down. If > one is, it wont do the schedule, if one is not, then it does the > schedule. When a cpu goes down, the rcu cpu notifier will wake up the > thread before it does more work. > > This patch does prevent the issue from occurring. > > Signed-off-by: Steven Rostedt <rost...@goodmis.org> > > Index: linux-trace.git/kernel/rcutree.c > =================================================================== > --- linux-trace.git.orig/kernel/rcutree.c > +++ linux-trace.git/kernel/rcutree.c > @@ -1396,6 +1396,34 @@ rcu_start_gp_per_cpu(struct rcu_state *r > __note_new_gpnum(rsp, rnp, rdp); > } > > +#ifdef CONFIG_PROVE_RCU_DELAY > +/* Need to kick rcu_preempt kthread if cpu is going down */ > +static DECLARE_WAIT_QUEUE_HEAD(rcu_delay_wait); > +static bool rcu_delay_cpu_going_down; > +static void rcu_delay_down_done(void) > +{ > + rcu_delay_cpu_going_down = false; > +} > +static void rcu_delay_down(void) > +{ > + rcu_delay_cpu_going_down = true; > + /* Make sure the rcu_preempt thread see this set */ > + smp_wmb(); > + /* > + * We may wake up rcu_preempt threads for other CPUs, but > + * that's OK. It's sleeping for debug purposes only. > + */ > + wake_up_interruptible(&rcu_delay_wait); > +} > +#else > +static inline void rcu_delay_down_done(void) > +{ > +} > +static inline void rcu_delay_down(void) > +{ > +} > +#endif > + > /* > * Initialize a new grace period. > */ > @@ -1452,8 +1480,26 @@ static int rcu_gp_init(struct rcu_state > raw_spin_unlock_irq(&rnp->lock); > #ifdef CONFIG_PROVE_RCU_DELAY > if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 && > - system_state == SYSTEM_RUNNING) > - schedule_timeout_uninterruptible(2); > + system_state == SYSTEM_RUNNING) { > + DECLARE_WAITQUEUE(wait, current); > + /* > + * If the current CPU goes offline, the rcu_preempt may > + * never wake up from the schedule_timeout(). This is > + * because it holds the onoff_mutex which gets taken > + * by the RCU CPU down notifier. This will prevent the > timer > + * notifier from migrating the rcu_preempt and letting > + * it wake up, causing a deadlock. > + * Thus we have this hacky waitqueue and flag to make > sure > + * that if the CPU goes down, we either skip the > + * schedule_timeout or we wake up the rcu_preempt > thread. > + */ > + set_current_state(TASK_UNINTERRUPTIBLE); > + add_wait_queue(&rcu_delay_wait, &wait); > + if (!rcu_delay_cpu_going_down) > + schedule_timeout(2); > + __set_current_state(TASK_RUNNING); > + remove_wait_queue(&rcu_delay_wait, &wait); > + } > #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */ > cond_resched(); > } > @@ -3074,8 +3120,10 @@ static int __cpuinit rcu_cpu_notify(stru > case CPU_ONLINE: > case CPU_DOWN_FAILED: > rcu_boost_kthread_setaffinity(rnp, -1); > + rcu_delay_down_done(); > break; > case CPU_DOWN_PREPARE: > + rcu_delay_down(); > rcu_boost_kthread_setaffinity(rnp, cpu); > break; > case CPU_DYING: > @@ -3087,6 +3135,7 @@ static int __cpuinit rcu_cpu_notify(stru > case CPU_DEAD_FROZEN: > case CPU_UP_CANCELED: > case CPU_UP_CANCELED_FROZEN: > + rcu_delay_down_done(); > for_each_rcu_flavor(rsp) > rcu_cleanup_dead_cpu(cpu, rsp); > break; > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/