On Tue, 30 Aug 2016 18:49:37 +1000 Balbir Singh <bsinghar...@gmail.com> wrote:
> The origin of the issue I've seen seems to be related to > rwsem spin lock stealing. Basically I see the system deadlock'd in the > following state BTW. this is not really to do with rwsems, but purely a scheduler bug. It was seen in other callers too that follow similar pattern, but even when those don't follow this pattern that leads to deadlock, it could lead to a long busywait in the waker until the wakee next sleeps or gets preempted. > > I have a system with multiple threads and > > Most of the threads are stuck doing > > [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130 > [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130 > [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110 > [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90 > [67272.788965] [c000000012857b50] [c00000000028153c] > unlink_anon_vmas+0x15c/0x2c0 > [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0 > [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0 > [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0 > [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30 > [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100 > [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30 > [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4 > [67272.999016] Instruction dump: > > They are spinning on the sem->wait_lock, the holder of sem->wait_lock has > irq's disabled and is doing > > [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570 > [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260 > [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110 > [c00000037930fc40] c000000000121598 up_write+0x78/0x90 > [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0 > [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870 > [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0 > [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc > > The offset of try_to_wake_up is actually misleading, it is actually stuck > doing the following in try_to_wake_up > > while (p->on_cpu) > cpu_relax(); > > Analysis > > The issue is triggered, due to the following race > > CPU1 CPU2 > > while () { > if (cond) > break; > do { > schedule(); > set_current_state(TASK_UN..) > } while (!cond); > rwsem_wake() > spin_lock_irqsave(wait_lock) > raw_spin_lock_irqsave(wait_lock) wake_up_process() > } try_to_wake_up() > set_current_state(TASK_RUNNING); .. > list_del(&waiter.list); > > CPU2 wakes up CPU1, but before it can get the wait_lock and set > current state to TASK_RUNNING the following occurs > > CPU3 > (stole the rwsem before waiter can be woken up from queue) > up_write() > rwsem_wake() > raw_spin_lock_irqsave(wait_lock) > if (!list_empty) > wake_up_process() > try_to_wake_up() > raw_spin_lock_irqsave(p->pi_lock) > .. > if (p->on_rq && ttwu_wakeup()) > .. > while (p->on_cpu) > cpu_relax() > .. > > CPU3 tries to wake up the task on CPU1 again since it finds > it on the wait_queue, CPU1 is spinning on wait_lock, but immediately > after CPU2, CPU3 got it. > > CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and > the task is spinning on the wait_lock. Interestingly since p->on_rq > is checked under pi_lock, I've noticed that try_to_wake_up() finds > p->on_rq to be 0. This was the most confusing bit of the analysis, > but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq > check is not reliable without this fix IMHO. The race is visible > (based on the analysis) only when ttwu_queue() does a remote wakeup > via ttwu_queue_remote. In which case the p->on_rq change is not > done uder the pi_lock. > > The result is that after a while the entire system locks up on > the raw_spin_irqlock_save(wait_lock) and the holder spins infintely > > Reproduction of the issue > > The issue can be reproduced after a long run on my system with 80 > threads and having to tweak available memory to very low and running > memory stress-ng mmapfork test. It usually takes a long time to > reproduce. I am trying to work on a test case that can reproduce > the issue faster, but thats work in progress. I am still testing the > changes on my still in a loop and the tests seem OK thus far. > > Big thanks to Benjamin and Nick for helping debug this as well. > Ben helped catch the missing barrier, Nick caught every missing > bit in my theory > > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Nicholas Piggin <npig...@gmail.com> > Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org> > > Signed-off-by: Balbir Singh <bsinghar...@gmail.com> > --- > kernel/sched/core.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2a906f2..582c684 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int > state, int wake_flags) > success = 1; /* we're going to change ->state */ > cpu = task_cpu(p); > > + /* > + * Ensure we see on_rq and p_state consistently > + * > + * For example in __rwsem_down_write_failed(), we have > + * [S] ->on_rq = 1 [L] ->state > + * MB RMB > + * [S] ->state = TASK_UNINTERRUPTIBLE [L] ->on_rq > + * In the absence of the RMB p->on_rq can be observed to be 0 > + * and we end up spinning indefinitely in while (p->on_cpu) > + */ > + smp_rmb(); > if (p->on_rq && ttwu_remote(p, wake_flags)) > goto stat; >