David Rientjes wrote: > There may not be any eligible processes left and then the machine panics.
Some of enterprise users might prefer "kernel panic followed by kdump and automatic reboot" to "a system is not responding for unpredictable period", for the panic helps getting information for analyzing what process caused the freeze. Well, can they use "Panic (Reboot) On Soft Lockups" option? > These time-based delays also have caused a complete depletion of memory > reserves if more than one process is chosen and each consumes an > non-neglible amount of memory which would then cause livelock. We used to > have a jiffies-based rekill in 2.6.18 internally and we finally could > remove it when mm->mmap_sem issues were fixed (mostly by checking for > fatal_signal_pending() and aborting when necessary). So, you've already tried that. Currently the OOM killer kills a process after blocking_notifier_call_chain(&oom_notify_list, 0, &freed); in out_of_memory() released all reclaimable memory. This call helps reducing the chance to kill a process if the bad process no longer asks for more memory. But if the bad process continues asking for more memory and the chosen task is in TASK_UNINTERRUPTIBLE state, this call helps the OOM killer to be disabled for unpredictable period. Therefore, releasing all reclaimable memory before the OOM killer kills a process might be considered bad. Then, what about an approach described below? (1) Introduce a kernel thread which reserves (e.g.) 1 percent of kernel memory (this amount should be configurable via sysctl) upon startup. (2) The kernel thread sleeps using wait_event(memory_reservoir_wait) and releases PAGE_SIZE bytes from the reserved memory upon each wakeup. (3) The OOM killer calls wake_up() like if (test_tsk_thread_flag(task, TIF_MEMDIE)) { if (unlikely(frozen(task))) __thaw_task(task); + /* Let the memory reservoir release memory if the chosen process cannot die. */ + if (time_after(jiffies, p->memdie_stamp) && + task->state == TASK_UNINTERRUPTIBLE) + wake_up(&memory_reservoir_wait); if (!force_kill) return OOM_SCAN_ABORT; } in oom_scan_process_thread(). (4) When a task where test_tsk_thread_flag(task, TIF_MEMDIE) is true has terminated and memory used by the task is reclaimed, the reclaimed memory is again reserved by the kernel thread up to 1 percent of kernel memory. In this way, we could shorten the duration of the OOM killer being disabled unless the reserved memory was not enough to terminate the chosen process. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/