+Cc davidlohr and waiman

On Thu, Nov 29, 2018 at 08:50:30PM +0800, Yongji Xie wrote:
> From: Xie Yongji <xieyon...@baidu.com>
> 
> Our system encountered a problem recently, the khungtaskd detected
> some process hang on mmap_sem. But the odd thing was that one task which
> is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
> Through code inspection, we found a potential bug can lead to this.
> 
> Imaging this:
> 
> Thread 1                                  Thread 2
>                                           down_write();
> rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                           __up_write();
>                                            rwsem_wake();
>                                             __rwsem_mark_wake();
>                                              wake_q_add();
>                                              list_del(&waiter->list);
>                                              waiter->task = NULL;
>  while (true) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (!waiter.task) // true
>       break;
>  }
>  __set_current_state(TASK_RUNNING);
> 
> Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
> Thread 1 call rwsem_down_read_failed() again because Thread 3
> hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
> do wakeup, it will fail and miss wakeup:
> 
> Thread 1                                  Thread 2      Thread 3
>                                                         down_write();
> rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                                         __rwsem_mark_wake();
>                                                          wake_q_add();
>                                           wake_up_q();
>                                                          waiter->task = NULL;
>  while (true) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (!waiter.task) // false
>       break;
>   schedule();
>  }
>                                                         wake_up_q(&wake_q);
> 
> In another word, that means we might issue the wakeup before setting the 
> reader
> waiter to nil. If so, the wakeup may do nothing when it was called before 
> reader
> set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake 
> up
> the reader any more, and cause other writers such as "ps" command stuck on it.
> 
> This patch is not verified because we still have no way to reproduce the 
> problem.
> But I'd like to ask for some comments from community firstly.

Urgh; so the case where the cmpxchg() fails because it already has a
wakeup in progress, which then 'violates' our expectation of when the
wakeup happens.

Yes, I think this is real, and worse, I think we need to go audit all
wake_q_add() users and document this behaviour.

In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
but I don't think we can easily fix that.

> Signed-off-by: Xie Yongji <xieyon...@baidu.com>
> Signed-off-by: Zhang Yu <zhangy...@baidu.com>
> ---
>  kernel/locking/rwsem-xadd.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 09b1800..50d9af6 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>               woken++;
>               tsk = waiter->task;
>  
> -             wake_q_add(wake_q, tsk);
> +             get_task_struct(tsk);
>               list_del(&waiter->list);
>               /*
> -              * Ensure that the last operation is setting the reader
> +              * Ensure calling get_task_struct() before setting the reader
>                * waiter to nil such that rwsem_down_read_failed() cannot
>                * race with do_exit() by always holding a reference count
>                * to the task to wakeup.
>                */
>               smp_store_release(&waiter->task, NULL);
> +             /*
> +              * Ensure issuing the wakeup (either by us or someone else)
> +              * after setting the reader waiter to nil.
> +              */
> +             wake_q_add(wake_q, tsk);
> +             /* wake_q_add() already take the task ref */
> +             put_task_struct(tsk);
>       }
>  
>       adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> -- 
> 2.2.3
> 

Reply via email to