On Sat 13-06-15 00:23:00, Tetsuo Handa wrote:
[...]
> >From e59b64683827151a35257384352c70bce61babdd Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Date: Fri, 12 Jun 2015 23:56:18 +0900
> Subject: [RFC] oom: implement memdie_task_panic_secs
> 
> OOM killer is a desperate last resort reclaim attempt to free some
> memory. It is based on heuristics which will never be 100% and may
> result in an unusable or a locked up system.
> 
> panic_on_oom sysctl knob allows to set the OOM policy to panic the
> system instead of trying to resolve the OOM condition. This might be
> useful for several reasons - e.g. reduce the downtime to a predictable
> amount of time, allow to get a crash dump of the system and debug the
> issue post-mortem.
> 
> panic_on_oom is, however, a big hammer in many situations when the
> OOM condition could be resolved in a reasonable time. So it would be
> good to have some middle ground and allow the OOM killer to do its job
> but have a failover when things go wrong and it is not able to make any
> further progress for a considerable amount of time.
> 
> This patch implements system_memdie_panic_secs sysctl which configures
> a maximum timeout for the OOM killer to resolve the OOM situation.
> If the system is still under OOM (i.e. the OOM victim cannot release
> memory) after the timeout expires, it will panic the system. A
> reasonably chosen timeout can protect from both temporal OOM conditions
> and allows to have a predictable time frame for the OOM condition.
> 
> Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM,
> this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs .

I really hate having so many knobs. What would they be good for? Why
cannot you simply use a single timeout and decide whether to panic or
not based on panic_on_oom value? Or do you have any strong reason to
put this aside from panic_on_oom?

> These will allow administrator to use different timeout settings for
> each type of OOM, for administrator still has chance to perform steps
> to resolve the potential lockup or trashing from the global context
> (e.g. by relaxing restrictions or even rebooting cleanly).
> 
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> ---
>  include/linux/oom.h   |  8 +++++
>  include/linux/sched.h |  1 +
>  kernel/sysctl.c       | 39 ++++++++++++++++++++++++
>  mm/oom_kill.c         | 83 
> +++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 131 insertions(+)
> 
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dff991e..40d7b6d0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
[...]
> +static void check_memdie_task(struct task_struct *p, struct mem_cgroup 
> *memcg,
> +                           const nodemask_t *nodemask)
> +{
> +     unsigned long start = p->memdie_start;
> +     unsigned long spent;
> +     unsigned long timeout = 0;
> +
> +     /* If task does not have TIF_MEMDIE flag, there is nothing to do. */
> +     if (!start)
> +             return;
> +     spent = jiffies - start;
> +#ifdef CONFIG_MEMCG
> +     /* task_in_mem_cgroup(p, memcg) is true. */
> +     if (memcg)
> +             timeout = sysctl_cgroup_memdie_panic_secs;
> +#endif
> +#ifdef CONFIG_NUMA
> +     /* has_intersects_mems_allowed(p, nodemask) is true. */
> +     else if (nodemask)
> +             timeout = sysctl_mempolicy_memdie_panic_secs;
> +     else
> +             timeout = sysctl_cpuset_memdie_panic_secs;
> +#endif
> +     if (!timeout)
> +             timeout = sysctl_system_memdie_panic_secs;
> +     /* If timeout is disabled, there is nothing to do. */
> +     if (!timeout)
> +             return;
> +#ifdef CONFIG_NUMA
> +     {
> +             struct task_struct *t;
> +
> +             rcu_read_lock();
> +             for_each_thread(p, t) {
> +                     start = t->memdie_start;
> +                     if (start && time_after(spent, timeout * HZ))
> +                             break;
> +             }
> +             rcu_read_unlock();

This doesn't make any sense to me. What are you trying to achieve here?
Why would you want to check all threads and do that only for CONFIG_NUMA
and even then do a noop if the timeout expired?

> +     }
> +#endif
> +     if (time_before(spent, timeout * HZ))
> +             return;

I think the whole function is way too much complicated without a good
reason. Why don't you simply store the expire time when marking the task
oom victim and compare it with the current jiffies with time_after and
be done with it. This is few lines of code...

> +     panic("Out of memory: %s (%u) did not die within %lu seconds.\n",
> +           p->comm, p->pid, timeout);

It would be better to sync this message with what check_panic_on_oom
does.

> +}
> +
>  /* return true if the task is not adequate as candidate victim task. */
>  static bool oom_unkillable_task(struct task_struct *p,
>               struct mem_cgroup *memcg, const nodemask_t *nodemask)
> @@ -135,6 +209,7 @@ static bool oom_unkillable_task(struct task_struct *p,
>       if (!has_intersects_mems_allowed(p, nodemask))
>               return true;
>  
> +     check_memdie_task(p, memcg, nodemask);

This is not sufficient. oom_scan_process_thread would break out from the
loop when encountering the first TIF_MEMDIE task and could have missed
an older one later in the task_list.
Besides that oom_unkillable_task doesn't sound like a good match to
evaluate this logic. I would expect it to be in oom_scan_process_thread.

>       return false;
>  }
>  
> @@ -416,10 +491,17 @@ bool oom_killer_disabled __read_mostly;
>   */
>  void mark_oom_victim(struct task_struct *tsk)
>  {
> +     unsigned long start;
> +
>       WARN_ON(oom_killer_disabled);
>       /* OOM killer might race with memcg OOM */
>       if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>               return;
> +     /* Set current time for is_killable_memdie_task() check. */
> +     start = jiffies;
> +     if (!start)
> +             start = 1;
> +     tsk->memdie_start = start;

I would rather go with tsk->oom_expire = jiffies + timeout and set the
timeout depending on panic_on_oom value (which would require nodemask
and memcg parameters here).

>       /*
>        * Make sure that the task is woken up from uninterruptible sleep
>        * if it is frozen because OOM killer wouldn't be able to free
> @@ -435,6 +517,7 @@ void mark_oom_victim(struct task_struct *tsk)
>   */
>  void exit_oom_victim(void)
>  {
> +     current->memdie_start = 0;

Is this really needed? OOM killer shouldn't see the task because it has
already released its mm. oom_scan_process_thread checks mm after it
TIF_MEMDIE so we can race theoretically but this shouldn't matter much.
If a task is still visible after the timeout then there obviously was a
problem in making progress.

>       clear_thread_flag(TIF_MEMDIE);
>  
>       if (!atomic_dec_return(&oom_victims))
> -- 
> 1.8.3.1
> ------------------------------------------------------------
> 
> By the way, with introduction of per "struct task_struct" variable, I think
> that we can replace TIF_MEMDIE checks with memdie_start checks via
> 
>   test_tsk_thread_flag(p, TIF_MEMDIE) => p->memdie_start
> 
>   test_and_clear_thread_flag(TIF_MEMDIE) => xchg(&current->memdie_start, 0)
> 
>   test_and_set_tsk_thread_flag(p, TIF_MEMDIE)
>   => xchg(&p->memdie_start, jiffies (or 1 if jiffies == 0))
> 
> though above patch did not replace TIF_MEMDIE in order to focus on one thing.

I fail to see a direct advantage other than to safe one bit in flags. Is
something asking for it?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to