Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

Nicholas Piggin Tue, 30 Aug 2016 02:14:44 -0700

On Tue, 30 Aug 2016 18:49:37 +1000
Balbir Singh <bsinghar...@gmail.com> wrote:


> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

BTW. this is not really to do with rwsems, but purely a scheduler bug.
It was seen in other callers too that follow similar pattern, but even
when those don't follow this pattern that leads to deadlock, it could
lead to a long busywait in the waker until the wakee next sleeps or
gets preempted.

> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
> [67272.788965] [c000000012857b50] [c00000000028153c] 
> unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
>       cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1                                  CPU2
> 
> while () {
>   if (cond)
>     break;
>   do {
>     schedule();
>     set_current_state(TASK_UN..)
>   } while (!cond);
>                                       rwsem_wake()
>                                         spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)      wake_up_process()
> }                                       try_to_wake_up()
> set_current_state(TASK_RUNNING);        ..
> list_del(&waiter.list);
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
>     cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Nicholas Piggin <npig...@gmail.com>
> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> 
> Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
> ---
>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>       success = 1; /* we're going to change ->state */
>       cpu = task_cpu(p);
>  
> +     /*
> +      * Ensure we see on_rq and p_state consistently
> +      *
> +      * For example in __rwsem_down_write_failed(), we have
> +      *    [S] ->on_rq = 1                           [L] ->state
> +      *    MB                                         RMB
> +      *    [S] ->state = TASK_UNINTERRUPTIBLE        [L] ->on_rq
> +      * In the absence of the RMB p->on_rq can be observed to be 0
> +      * and we end up spinning indefinitely in while (p->on_cpu)
> +      */
> +     smp_rmb();
>       if (p->on_rq && ttwu_remote(p, wake_flags))
>               goto stat;
>

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

Reply via email to