On Tue, Aug 12, 2014 at 09:31:37AM +1000, Anton Blanchard wrote: > The hard lockup detector uses a PMU event as a periodic NMI to > detect if we are stuck (where stuck means no timer interrupts have > occurred). > > Ben's rework of the ppc64 soft disable code has made ppc64 PMU > exceptions a partial NMI. They can get disabled if an external interrupt > comes in, but otherwise PMU interrupts will fire in interrupt disabled > regions. > > I wrote a kernel module to test this patch and noticed we sometimes > missed hard lockup warnings. The RCU code detected the stall first and > issued an IPI to backtrace all CPUs. Unfortunately an IPI is an external > interrupt and that will hard disable interrupts, preventing the hard > lockup detector from going off.
If it helps, commit bc1dce514e9b (rcu: Don't use NMIs to dump other CPUs' stacks) makes RCU avoid this behavior. It instead reads the stacks out remotely when this commit is applied. It is in -tip, and should make mainline this merge window. Corresponding patch below. Thanx, Paul ------------------------------------------------------------------------ rcu: Don't use NMIs to dump other CPUs' stacks Although NMI-based stack dumps are in principle more accurate, they are also more likely to trigger deadlocks. This commit therefore replaces all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so that the CPU detecting an RCU CPU stall does the stack dumping. Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com> Reviewed-by: Lai Jiangshan <la...@cn.fujitsu.com> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3f93033d3c61..8f3e4d43d736 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1013,10 +1013,7 @@ static void record_gp_stall_check_time(struct rcu_state *rsp) } /* - * Dump stacks of all tasks running on stalled CPUs. This is a fallback - * for architectures that do not implement trigger_all_cpu_backtrace(). - * The NMI-triggered stack traces are more accurate because they are - * printed by the target CPU. + * Dump stacks of all tasks running on stalled CPUs. */ static void rcu_dump_cpu_stacks(struct rcu_state *rsp) { @@ -1094,7 +1091,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp) (long)rsp->gpnum, (long)rsp->completed, totqlen); if (ndetected == 0) pr_err("INFO: Stall ended before state dump start\n"); - else if (!trigger_all_cpu_backtrace()) + else rcu_dump_cpu_stacks(rsp); /* Complain about tasks blocking the grace period. */ @@ -1125,8 +1122,7 @@ static void print_cpu_stall(struct rcu_state *rsp) pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n", jiffies - rsp->gp_start, (long)rsp->gpnum, (long)rsp->completed, totqlen); - if (!trigger_all_cpu_backtrace()) - dump_stack(); + rcu_dump_cpu_stacks(rsp); raw_spin_lock_irqsave(&rnp->lock, flags); if (ULONG_CMP_GE(jiffies, ACCESS_ONCE(rsp->jiffies_stall))) _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev