> > Going back to Michal's example, say the user configured the following: > > root > / \ > A D > / \ > B C > > A global OOM event happens and we find this: > - A > D > - B, C, D are oomgroups > > What the user is telling us is that B, C, and D are compound memory > consumers. They cannot be divided into their task parts from a memory > point of view. > > However, the user doesn't say the same for A: the A subtree summarizes > and controls aggregate consumption of B and C, but without groupoom > set on A, the user says that A is in fact divisible into independent > memory consumers B and C. > > If we don't have to kill all of A, but we'd have to kill all of D, > does it make sense to compare the two? >
I think Tim has given very clear explanation why comparing A & D makes perfect sense. However I think the above example, a single user system where a user has designed and created the whole hierarchy and then attaches different jobs/applications to different nodes in this hierarchy, is also a valid scenario. One solution I can think of, to cater both scenarios, is to introduce a notion of 'bypass oom' or not include a memcg for oom comparision and instead include its children in the comparison. So, in the same above example: root / \ A(b) D / \ B C A is marked as bypass and thus B and C are to be compared to D. So, for the single user scenario, all the internal nodes are marked 'bypass oom comparison' and oom_priority of the leaves has to be set to the same value. Below is the pseudo code of select_victim_memcg() based on this idea and David's previous pseudo code. The calculation of size of a memcg is still not very well baked here yet. I am working on it and I plan to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer" patch. struct mem_cgroup *memcg = root_mem_cgroup; struct mem_cgroup *selected_memcg = root_mem_cgroup; struct mem_cgroup *low_memcg; unsigned long low_priority; unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code LIST_HEAD(queue); next_level: low_memcg = NULL; low_priority = ULONG_MAX; next: for_each_child_of_memcg(it, memcg) { unsigned long prio = it->oom_priority; unsigned long badness = 0; if (it->bypass_oom && !it->oom_group && memcg_has_children(it)) { list_add(&it->oom_queue, &queue); continue; } if (prio > low_priority) continue; if (prio == low_priority) { badness = mem_cgroup_usage(it); // for simplicity, need more thinking if (badness < prev_badness) continue; } low_memcg = it; low_priority = prio; prev_badness = badness ?: mem_cgroup_usage(it); // for simplicity } if (!list_empty(&queue)) { memcg = list_last_entry(&queue, struct mem_cgroup, oom_queue); list_del(&memcg->oom_queue); goto next; } if (low_memcg) { selected_memcg = memcg = low_memcg; prev_badness = 0; if (!low_memcg->oom_group) goto next_level; } if (selected_memcg->oom_group) oom_kill_memcg(selected_memcg); else oom_kill_process_from_memcg(selected_memcg);