>
> Going back to Michal's example, say the user configured the following:
>
>        root
>       /    \
>      A      D
>     / \
>    B   C
>
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
>
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
>
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
>
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
>

I think Tim has given very clear explanation why comparing A & D makes
perfect sense. However I think the above example, a single user system
where a user has designed and created the whole hierarchy and then
attaches different jobs/applications to different nodes in this
hierarchy, is also a valid scenario. One solution I can think of, to
cater both scenarios, is to introduce a notion of 'bypass oom' or not
include a memcg for oom comparision and instead include its children
in the comparison.

So, in the same above example:
        root
       /       \
      A(b)    D
     /  \
    B   C

A is marked as bypass and thus B and C are to be compared to D. So,
for the single user scenario, all the internal nodes are marked
'bypass oom comparison' and oom_priority of the leaves has to be set
to the same value.

Below is the pseudo code of select_victim_memcg() based on this idea
and David's previous pseudo code. The calculation of size of a memcg
is still not very well baked here yet. I am working on it and I plan
to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer"
patch.


        struct mem_cgroup *memcg = root_mem_cgroup;
        struct mem_cgroup *selected_memcg = root_mem_cgroup;
        struct mem_cgroup *low_memcg;
        unsigned long low_priority;
        unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code
        LIST_HEAD(queue);

next_level:
        low_memcg = NULL;
        low_priority = ULONG_MAX;

next:
        for_each_child_of_memcg(it, memcg) {
                unsigned long prio = it->oom_priority;
                unsigned long badness = 0;

                if (it->bypass_oom && !it->oom_group &&
memcg_has_children(it)) {
                        list_add(&it->oom_queue, &queue);
                        continue;
                }

                if (prio > low_priority)
                        continue;

                if (prio == low_priority) {
                        badness = mem_cgroup_usage(it); // for
simplicity, need more thinking
                        if (badness < prev_badness)
                                continue;
                }

                low_memcg = it;
                low_priority = prio;
                prev_badness = badness ?: mem_cgroup_usage(it);  //
for simplicity
        }
        if (!list_empty(&queue)) {
                memcg = list_last_entry(&queue, struct mem_cgroup, oom_queue);
                list_del(&memcg->oom_queue);
                goto next;
        }
        if (low_memcg) {
                selected_memcg = memcg = low_memcg;
                prev_badness = 0;
                if (!low_memcg->oom_group)
                        goto next_level;
        }
        if (selected_memcg->oom_group)
                oom_kill_memcg(selected_memcg);
        else
                oom_kill_process_from_memcg(selected_memcg);

Reply via email to