There are two places where the numa balancing code sets a task's numa_preferred_nid.
The primary location is task_numa_placement(), where the kernel examines the NUMA fault statistics to determine the location where most of the memory that the task (or numa_group) accesses is. The second location is only used for large workloads, where a numa_group has enough tasks that the tasks are spread out over several NUMA nodes, and multiple nodes are in the numa group's active_nodes mask. In order to allow those workloads to settle down, we pretend that any node inside the numa_group's active_nodes mask is the task's new preferred node. This dissuades task_numa_fault() from continuously retrying to migrate the task to the group's preferred node, and allows a multi-node workload to settle down. This in turn improves locality of private faults inside a numa group. Reported-by: Shrikar Dronamraju <sri...@linux.vnet.ibm.com> Signed-off-by: Rik van Riel <r...@redhat.com> --- kernel/sched/fair.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c2980e8733bc..54bb57f09e75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1485,7 +1485,12 @@ static int task_numa_migrate(struct task_struct *p) groupweight = group_weight(p, env.src_nid, dist); } - /* Only consider nodes where both task and groups benefit */ + /* + * Only consider nodes where placement is better for + * either the group (help large workloads converge), + * or the task (placement of tasks within a numa group, + * and single threaded processes). + */ taskimp = task_weight(p, nid, dist) - taskweight; groupimp = group_weight(p, nid, dist) - groupweight; if (taskimp < 0 && groupimp < 0) @@ -1499,12 +1504,14 @@ static int task_numa_migrate(struct task_struct *p) } /* - * If the task is part of a workload that spans multiple NUMA nodes, - * and is migrating into one of the workload's active nodes, remember - * this node as the task's preferred numa node, so the workload can - * settle down. - * A task that migrated to a second choice node will be better off - * trying for a better one later. Do not set the preferred node here. + * The primary place for setting a task's numa_preferred_nid is in + * task_numa_placement(). If a task is moved to a sub-optimal node, + * leave numa_preferred_nid alone, so task_numa_fault() will retry + * migrating the task to where it really belongs. + * The exception is a task that belongs to a large numa_group, which + * spans multiple NUMA nodes. If that task migrates into one of the + * workload's active nodes, remember that node as the task's + * numa_preferred_nid, so the workload can settle down. */ if (p->numa_group) { if (env.best_cpu == -1) @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p) nid = env.dst_nid; if (node_isset(nid, p->numa_group->active_nodes)) - sched_setnuma(p, env.dst_nid); + sched_setnuma(p, nid); } /* No better CPU than the current one was found. */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/