On Mon 13-06-16 13:19:43, Michal Hocko wrote:
[...]
> I am trying to remember why we are disabling oom killer before kernel
> threads are frozen but not really sure about that right away.

OK, I guess I remember now. Say that a task would depend on a freezable
kernel thread to get to do_exit (stuck in wait_event etc...). We would
simply get stuck in oom_killer_disable for ever. So we need to address
it a different way.

One way would be what you are proposing but I guess it would be more
systematic to never call exit_oom_victim on a remote task.  After [1] we
have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
is set. It is more code than your patch so I can see a reason to go with
yours if the following one seems too large or ugly.

[1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mho...@kernel.org

What do you think about the following?
---
>From f7ffecf944648f16d98fc70275041433ed68f7e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Mon, 20 Jun 2016 17:28:18 +0200
Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race

Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This patch effectivelly reverts 449d777d7ad6 ("mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper"). We have a more
efficient way to achieve the same. We can rely on MMF_REAPED because
"mm, oom: hide mm which is shared with kthread or global init" made sure
that we skip tasks with MMF_REAPED even when they have TIF_MEMDIE.

The only missing part is that we have to handle __oom_reap_task failure
gracefully. Currently we mark the mm by MMF_OOM_NOT_REAPABLE to later
set MMF_REAPED should we fail __oom_reap_task again. This relies on
clearing TIF_MEMDIE to select that task again. We can achieve a similar
thing by requeuing the task to the oom_reaper_list. Introduce
oom_reaper_add_tail to add the task to the tail of the oom_reaper_list.
While we are at it reuse it for wake_oom_reaper as well because the
queueing should be FIFO.

In the result only the oom victim will be responsible to clear its
TIF_MEMDIE flag so the above race is not possible anymore. oom_reaper
will be allowed to play only with MMF_* flags so the responsibility
model will be more clear.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for 
oom_reaper")
Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 mm/oom_kill.c | 86 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 55 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..dbc62a69fee8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -558,6 +558,51 @@ static bool __oom_reap_task(struct task_struct *tsk)
        return ret;
 }
 
+static void oom_reaper_add_tail(struct task_struct *tsk)
+{
+       struct task_struct *q;
+
+       spin_lock(&oom_reaper_lock);
+       if (!oom_reaper_list) {
+               oom_reaper_list = tsk;
+               goto unlock;
+       }
+
+       for (q = oom_reaper_list; q->oom_reaper_list; q = q->oom_reaper_list)
+               ;
+       q->oom_reaper_list = tsk;
+
+unlock:
+       spin_unlock(&oom_reaper_lock);
+}
+
+static inline bool requeue_oom_victim(struct task_struct *tsk)
+{
+       struct task_struct *p;
+       bool ret;
+
+       /*
+        * If we've already tried to reap this task in the past and
+        * failed it probably doesn't make much sense to try yet again
+        * so hide the mm from the oom killer so that it can move on
+        * to another task with a different mm struct.
+        */
+       p = find_lock_task_mm(tsk);
+       if (!p)
+               return false;
+
+       if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags))
+               set_bit(MMF_OOM_REAPED, &p->mm->flags);
+       else
+               ret = true;
+
+       task_unlock(p);
+
+       if (ret)
+               oom_reaper_add_tail(tsk);
+       return ret;
+}
+
 #define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
@@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
        while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
                schedule_timeout_idle(HZ/10);
 
-       if (attempts > MAX_OOM_REAP_RETRIES) {
-               struct task_struct *p;
+       tsk->oom_reaper_list = NULL;
 
+       if (attempts > MAX_OOM_REAP_RETRIES) {
                pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
                                task_pid_nr(tsk), tsk->comm);
 
-               /*
-                * If we've already tried to reap this task in the past and
-                * failed it probably doesn't make much sense to try yet again
-                * so hide the mm from the oom killer so that it can move on
-                * to another task with a different mm struct.
-                */
-               p = find_lock_task_mm(tsk);
-               if (p) {
-                       if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, 
&p->mm->flags)) {
-                               pr_info("oom_reaper: giving up pid:%d (%s)\n",
-                                               task_pid_nr(tsk), tsk->comm);
-                               set_bit(MMF_OOM_REAPED, &p->mm->flags);
-                       }
-                       task_unlock(p);
+               debug_show_all_locks();
+
+               if (requeue_oom_victim(tsk)) {
+                       cond_resched();
+                       return;
                }
 
-               debug_show_all_locks();
+               pr_info("oom_reaper: giving up pid:%d (%s)\n",
+                               task_pid_nr(tsk), tsk->comm);
        }
 
-       /*
-        * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-        * reasonably reclaimable memory anymore or it is not a good candidate
-        * for the oom victim right now because it cannot release its memory
-        * itself nor by the oom reaper.
-        */
-       tsk->oom_reaper_list = NULL;
-       exit_oom_victim(tsk);
-
        /* Drop a reference taken by wake_oom_reaper */
        put_task_struct(tsk);
 }
@@ -637,11 +665,7 @@ void wake_oom_reaper(struct task_struct *tsk)
                return;
 
        get_task_struct(tsk);
-
-       spin_lock(&oom_reaper_lock);
-       tsk->oom_reaper_list = oom_reaper_list;
-       oom_reaper_list = tsk;
-       spin_unlock(&oom_reaper_lock);
+       oom_reaper_add_tail(tsk);
        wake_up(&oom_reaper_wait);
 }
 
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

Reply via email to