On Sun, Mar 15, 2015 at 07:32:32PM -0400, Sasha Levin wrote: > On 03/15/2015 04:43 PM, Paul E. McKenney wrote: > > But I did find a bug that would result in the other warnings, and could > > also result in too-short grace periods, which could in turn result in > > arbitrarily arbitrary misbehavior. The patch below, which is also on > > its way into -next, should fix this. Please let me know how it does > > for you. > > I've stopped seeing the warnings I've previously reported, but started > seeing a new one: > > [ 788.564596] WARNING: CPU: 12 PID: 9711 at kernel/rcu/tree.c:2201 > rcu_report_qs_rnp+0x42e/0x5a0() > [ 788.568123] Modules linked in: > [ 788.568123] CPU: 12 PID: 9711 Comm: trinity-main Not tainted > 4.0.0-rc3-next-20150313-sasha-00041-g83a3dc8-dirty #2078 > [ 788.568123] ffff8803a1ba0000 00000000400df16a ffff880442807cc8 > ffffffffb1ab01ca > [ 788.568123] 0000000000000000 0000000000000000 ffff880442807d18 > ffffffffa71e261a > [ 788.568123] dffffc0000000000 ffffffffa733d2ee ffff880442807d28 > ffffffffb4724000 > [ 788.568123] Call Trace: > [ 788.568123] <IRQ> dump_stack (lib/dump_stack.c:52) > [ 788.568123] warn_slowpath_common (kernel/panic.c:447) > [ 788.568123] ? rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3)) > [ 788.568123] warn_slowpath_null (kernel/panic.c:481) > [ 788.568123] rcu_report_qs_rnp (kernel/rcu/tree.c:2201 (discriminator 3)) > [ 788.568123] rcu_process_callbacks (kernel/rcu/tree.c:2302 > kernel/rcu/tree.c:2338 kernel/rcu/tree.c:2824 kernel/rcu/tree.c:2857) > [ 788.568123] __do_softirq (kernel/softirq.c:273 > include/linux/jump_label.h:114 include/trace/events/irq.h:126 > kernel/softirq.c:274) > [ 788.568123] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391) > [ 788.568123] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918) > [ 788.568123] apic_timer_interrupt (arch/x86/kernel/entry_64.S:920) > [ 788.568123] <EOI> ? mark_held_locks (kernel/locking/lockdep.c:2525) > [ 788.568123] finish_task_switch (kernel/sched/core.c:2231) > [ 788.568123] __schedule (kernel/sched/core.c:2337 kernel/sched/core.c:2795) > [ 788.568123] schedule (./arch/x86/include/asm/bitops.h:311 (discriminator > 1) kernel/sched/core.c:2824 (discriminator 1)) > [ 788.568123] schedule_preempt_disabled (kernel/sched/core.c:2856) > [ 788.568123] mutex_lock_nested (kernel/locking/mutex.c:585 > kernel/locking/mutex.c:623) > [ 788.568123] kernfs_iop_permission (fs/kernfs/inode.c:366) > [ 788.568123] __inode_permission (fs/namei.c:374 fs/namei.c:408) > [ 788.568123] inode_permission (fs/namei.c:460) > [ 788.568123] link_path_walk (fs/namei.c:1520 fs/namei.c:1782) > [ 788.568123] path_init (fs/namei.c:1947) > [ 788.568123] path_lookupat (fs/namei.c:1989) > [ 788.568123] filename_lookup (fs/namei.c:2025) > [ 788.568123] user_path_at_empty (fs/namei.c:2189) > [ 788.568123] user_path_at (fs/namei.c:2200) > [ 788.568123] vfs_fstatat (fs/stat.c:106) > [ 788.568123] SYSC_newfstatat (fs/stat.c:298) > [ 788.568123] SyS_newfstatat (fs/stat.c:291) > [ 788.568123] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
OK, I guess it would help to update the WARN_ON()s while I am at it. :-/ Here is an updated patch that replaces the one resulting in the above splat. Thanx, Paul ------------------------------------------------------------------------ rcu: Associate quiescent-state reports with grace period As noted in earlier commit logs, CPU hotplug operations running concurrently with grace-period initialization can result in a given leaf rcu_node structure having all CPUs offline and no blocked readers, but with this rcu_node structure nevertheless blocking the current grace period. Therefore, the quiescent-state forcing code now checks for this situation and repairs it. Unfortunately, this checking can result in false positives, for example, when the last task has just removed itself from this leaf rcu_node structure, but has not yet started clearing the ->qsmask bits further up the structure. This means that the grace-period kthread (which forces quiescent states) and some other task might be attempting to concurrently clear these ->qsmask bits. This is usually not a problem: One of these tasks will be the first to acquire the upper-level rcu_node structure's lock and with therefore clear the bit, and the other task, seeing the bit already cleared, will stop trying to clear bits. Sadly, this means that the following unusual sequence of events -can- result in a problem: 1. The grace-period kthread wins, and clears the ->qsmask bits. 2. This is the last thing blocking the current grace period, so that the grace-period kthread clears ->qsmask bits all the way to the root and finds that the root ->qsmask field is now zero. 3. Another grace period is required, so that the grace period kthread initializes it, including setting all the needed qsmask bits. 4. The leaf rcu_node structure (the one that started this whole mess) is blocking this new grace period, either because it has at least one online CPU or because there is at least one task that had blocked within an RCU read-side critical section while running on one of this leaf rcu_node structure's CPUs. (And yes, that CPU might well have gone offline before the grace period in step (3) above started, which can mean that there is a task on the leaf rcu_node structure's ->blkd_tasks list, but ->qsmask equal to zero.) 5. The other kthread didn't get around to trying to clear the upper level ->qsmask bits until all the above had happened. This means that it now sees bits set in the upper-level ->qsmask field, so it proceeds to clear them. Too bad that it is doing so on behalf of a quiescent state that does not apply to the current grace period! This sequence of events can result in the new grace period being too short. It can also result in the new grace period ending before the leaf rcu_node structure's ->qsmask bits have been cleared, which will result in splats during initialization of the next grace period. In addition, it can result in tasks blocking the new grace period still being queued at the start of the next grace period, which will result in other splats. Sasha's testing turned up another of these splats, as did rcutorture testing. (And yes, rcutorture is being adjusted to make these splats show up more quickly. Which probably is having the undesirable side effect of making other problems show up less quickly. Can't have everything!) Reported-by: Sasha Levin <sasha.le...@oracle.com> Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 664b052f82a9..e8699fb94ed2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2156,25 +2156,32 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags) * Similar to rcu_report_qs_rdp(), for which it is a helper function. * Allows quiescent states for a group of CPUs to be reported at one go * to the specified rcu_node structure, though all the CPUs in the group - * must be represented by the same rcu_node structure (which need not be - * a leaf rcu_node structure, though it often will be). That structure's - * lock must be held upon entry, and it is released before return. + * must be represented by the same rcu_node structure (which need not be a + * leaf rcu_node structure, though it often will be). The gps parameter + * is the grace-period snapshot, which means that the quiescent states + * are valid only if rnp->gpnum is equal to gps. That structure's lock + * must be held upon entry, and it is released before return. */ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp, - struct rcu_node *rnp, unsigned long flags) + struct rcu_node *rnp, unsigned long gps, unsigned long flags) __releases(rnp->lock) { + unsigned long oldmask = 0; struct rcu_node *rnp_c; /* Walk up the rcu_node hierarchy. */ for (;;) { - if (!(rnp->qsmask & mask)) { + if (!(rnp->qsmask & mask) || rnp->gpnum != gps) { - /* Our bit has already been cleared, so done. */ + /* + * Our bit has already been cleared, or the + * relevant grace period is already over, so done. + */ raw_spin_unlock_irqrestore(&rnp->lock, flags); return; } + WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */ rnp->qsmask &= ~mask; trace_rcu_quiescent_state_report(rsp->name, rnp->gpnum, mask, rnp->qsmask, rnp->level, @@ -2198,7 +2205,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp, rnp = rnp->parent; raw_spin_lock_irqsave(&rnp->lock, flags); smp_mb__after_unlock_lock(); - WARN_ON_ONCE(rnp_c->qsmask); + oldmask = rnp_c->qsmask; } /* @@ -2220,6 +2227,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp, struct rcu_node *rnp, unsigned long flags) __releases(rnp->lock) { + unsigned long gps; unsigned long mask; struct rcu_node *rnp_p; @@ -2239,12 +2247,13 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp, return; } - /* Report up the rest of the hierarchy. */ + /* Report up the rest of the hierarchy, tracking current ->gpnum. */ + gps = rnp->gpnum; mask = rnp->grpmask; raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ raw_spin_lock(&rnp_p->lock); /* irqs already disabled. */ smp_mb__after_unlock_lock(); - rcu_report_qs_rnp(mask, rsp, rnp_p, flags); + rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags); } /* @@ -2295,7 +2304,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp) */ needwake = rcu_accelerate_cbs(rsp, rnp, rdp); - rcu_report_qs_rnp(mask, rsp, rnp, flags); /* rlses rnp->lock */ + rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags); + /* ^^^ Released rnp->lock */ if (needwake) rcu_gp_kthread_wake(rsp); } @@ -2754,8 +2764,8 @@ static void force_qs_rnp(struct rcu_state *rsp, } } if (mask != 0) { - /* Idle/offline CPUs, report. */ - rcu_report_qs_rnp(mask, rsp, rnp, flags); + /* Idle/offline CPUs, report (releases rnp->lock. */ + rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags); } else { /* Nothing to do here, so just drop the lock. */ raw_spin_unlock_irqrestore(&rnp->lock, flags); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/