Andrea's point about ksm_migrate_page() is an important one, and I've
answered that against his mail, but here's some other easier points.

On Mon, 24 Sep 2012, Petr Holasek wrote:

> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
> 
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_nodes value is possible only

merge_across_nodes

> when there are not any ksm shared pages in system.
> 
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_nodes=1

merge_across_nodes

> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes       1.7%
> 
> merge_nodes=0

merge_across_nodes

> 2 nodes       1%
> 4 nodes       0.32%
> 8 nodes       0.018%
> 
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> 
> Changelog:
> 
> v2: Andrew's objections were reflected:
>       - value of merge_nodes can't be changed while there are some ksm
>       pages in system
>       - merge_nodes sysfs entry appearance depends on CONFIG_NUMA
>       - more verbose documentation
>       - added some performance testing results
> 
> v3:   - more verbose documentation
>       - fixed race in merge_nodes store function
>       - introduced share_all debugging knob proposed by Andrew
>       - minor cleanups
> 
> v4:   - merge_nodes was renamed to merge_across_nodes
>       - share_all debug knob was dropped

Yes, you were right to drop the share_all knob for now.  I do like the
idea, but it was quite inappropriate to mix it in with this NUMAnode
patch.  And although I like the idea, I think it wants a bit more: I
already have a hacky "allksm" boot option patch to mm/mmap.c for my
own testing, which adds VM_MERGEABLE everywhere.  If I gave that up
(I'd like to!) in favour of yours, I think I would still be missing
all the mms that are created before there's a chance to switch the
share_all mode on.  Or maybe I misread your v3.  Anyway, that's a
different topic, happily taken off today's agenda.

>       - get_kpfn_nid helper
>       - fixed page migration behaviour
> 
> Signed-off-by: Petr Holasek <phola...@redhat.com>
> ---
>  Documentation/vm/ksm.txt |   7 +++
>  mm/ksm.c                 | 135 
> ++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 122 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index b392e49..100d58d 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds ksmd should sleep 
> before next scan
>                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
>                     Default: 20 (chosen for demonstration purposes)
>  
> +merge_nodes      - specifies if pages from different numa nodes can be 
> merged.

merge_across_nodes

> +                   When set to 0, ksm merges only pages which physically
> +                   reside in the memory area of same NUMA node. It brings
> +                   lower latency to access to shared page. Value can be
> +                   changed only when there is no ksm shared pages in system.
> +                   Default: 1
> +
>  run              - set 0 to stop ksmd from running but keep merged pages,
>                     set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
>                     set 2 to stop ksmd and unmerge all pages currently merged,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 47c8853..7c82032 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -36,6 +36,7 @@
>  #include <linux/hash.h>
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
> +#include <linux/numa.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -140,7 +141,10 @@ struct rmap_item {
>       unsigned long address;          /* + low bits used for flags below */
>       unsigned int oldchecksum;       /* when unstable */
>       union {
> -             struct rb_node node;    /* when node of unstable tree */
> +             struct {
> +                     struct rb_node node;    /* when node of unstable tree */
> +                     struct rb_root *root;
> +             };

This worries me a little, enlarging struct rmap_item: there may
be very many of them in the system, best to minimize their size.

(This struct rb_root *root is used for one thing only, isn't it?  For the
unstable rb_erase() in remove_rmap_item_from_tree().  It annoys me that
we need to record root just for that, but I don't see an alternative.)

The 64-bit case can be easily improved by locating unsigned int nid
after oldchecksum instead.  The 32-bit case (which supports smaller
NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
but 2 bits would be nice for keeping the BUG_ON(age > 1).

But you may think I'm going too far there, and prefer just to put
#ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
not enlarge the more basic 32-bit configuration.

>               struct {                /* when listed from stable tree */
>                       struct stable_node *head;
>                       struct hlist_node hlist;
> @@ -153,8 +157,8 @@ struct rmap_item {
>  #define STABLE_FLAG  0x200   /* is listed from the stable tree */
>  
>  /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES] = { RB_ROOT, };
> +static struct rb_root root_stable_tree[MAX_NUMNODES] = { RB_ROOT, };
>  
>  #define MM_SLOTS_HASH_SHIFT 10
>  #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
> @@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
>  #define KSM_RUN_STOP 0
>  #define KSM_RUN_MERGE        1
>  #define KSM_RUN_UNMERGE      2
> @@ -447,10 +454,25 @@ out:            page = NULL;
>       return page;
>  }
>  
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> +     if (ksm_merge_across_nodes)
> +             return 0;
> +     else
> +             return pfn_to_nid(kpfn);
> +}
> +
>  static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  {
>       struct rmap_item *rmap_item;
>       struct hlist_node *hlist;
> +     int nid;
>  
>       hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
>               if (rmap_item->hlist.next)
> @@ -462,7 +484,10 @@ static void remove_node_from_stable_tree(struct 
> stable_node *stable_node)
>               cond_resched();
>       }
>  
> -     rb_erase(&stable_node->node, &root_stable_tree);
> +     nid = get_kpfn_nid(stable_node->kpfn);
> +
> +     rb_erase(&stable_node->node,
> +                     &root_stable_tree[nid]);

Oh, if I ever get my fingers on that, it's going to join the line above :)

>       free_stable_node(stable_node);
>  }
>  
> @@ -560,7 +585,7 @@ static void remove_rmap_item_from_tree(struct rmap_item 
> *rmap_item)
>               age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
>               BUG_ON(age > 1);
>               if (!age)
> -                     rb_erase(&rmap_item->node, &root_unstable_tree);
> +                     rb_erase(&rmap_item->node, rmap_item->root);

Right, this is where we have to use rmap_item->root
or &root_unstable_tree[rmap_item->nid].

>  
>               ksm_pages_unshared--;
>               rmap_item->address &= PAGE_MASK;
> @@ -989,8 +1014,9 @@ static struct page *try_to_merge_two_pages(struct 
> rmap_item *rmap_item,
>   */
>  static struct page *stable_tree_search(struct page *page)
>  {
> -     struct rb_node *node = root_stable_tree.rb_node;
> +     struct rb_node *node;
>       struct stable_node *stable_node;
> +     int nid;
>  
>       stable_node = page_stable_node(page);
>       if (stable_node) {                      /* ksm page forked */
> @@ -998,6 +1024,9 @@ static struct page *stable_tree_search(struct page *page)
>               return page;
>       }
>  
> +     nid = get_kpfn_nid(page_to_pfn(page));
> +     node = root_stable_tree[nid].rb_node;
> +
>       while (node) {
>               struct page *tree_page;
>               int ret;
> @@ -1032,10 +1061,14 @@ static struct page *stable_tree_search(struct page 
> *page)
>   */
>  static struct stable_node *stable_tree_insert(struct page *kpage)
>  {
> -     struct rb_node **new = &root_stable_tree.rb_node;
> +     int nid;
> +     struct rb_node **new = NULL;

No need to initialize new.

>       struct rb_node *parent = NULL;
>       struct stable_node *stable_node;
>  
> +     nid = get_kpfn_nid(page_to_nid(kpage));
> +     new = &root_stable_tree[nid].rb_node;
> +
>       while (*new) {
>               struct page *tree_page;
>               int ret;
> @@ -1069,7 +1102,7 @@ static struct stable_node *stable_tree_insert(struct 
> page *kpage)
>               return NULL;
>  
>       rb_link_node(&stable_node->node, parent, new);
> -     rb_insert_color(&stable_node->node, &root_stable_tree);
> +     rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>  
>       INIT_HLIST_HEAD(&stable_node->hlist);
>  
> @@ -1097,10 +1130,16 @@ static
>  struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>                                             struct page *page,
>                                             struct page **tree_pagep)
> -

Thank you!

>  {
> -     struct rb_node **new = &root_unstable_tree.rb_node;
> +     struct rb_node **new = NULL;

No need to initialize new.

> +     struct rb_root *root;
>       struct rb_node *parent = NULL;
> +     int nid;
> +
> +     nid = get_kpfn_nid(page_to_pfn(page));
> +     root = &root_unstable_tree[nid];
> +
> +     new = &root->rb_node;
>  
>       while (*new) {
>               struct rmap_item *tree_rmap_item;
> @@ -1138,8 +1177,9 @@ struct rmap_item *unstable_tree_search_insert(struct 
> rmap_item *rmap_item,
>  
>       rmap_item->address |= UNSTABLE_FLAG;
>       rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> +     rmap_item->root = root;
>       rb_link_node(&rmap_item->node, parent, new);
> -     rb_insert_color(&rmap_item->node, &root_unstable_tree);
> +     rb_insert_color(&rmap_item->node, root);
>  
>       ksm_pages_unshared++;
>       return NULL;
> @@ -1282,6 +1322,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct 
> page **page)
>       struct mm_slot *slot;
>       struct vm_area_struct *vma;
>       struct rmap_item *rmap_item;
> +     int i;

I'd call it nid myself.

>  
>       if (list_empty(&ksm_mm_head.mm_list))
>               return NULL;
> @@ -1300,7 +1341,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct 
> page **page)
>                */
>               lru_add_drain_all();
>  
> -             root_unstable_tree = RB_ROOT;
> +             for (i = 0; i < MAX_NUMNODES; i++)
> +                     root_unstable_tree[i] = RB_ROOT;
>  
>               spin_lock(&ksm_mmlist_lock);
>               slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1758,7 +1800,12 @@ void ksm_migrate_page(struct page *newpage, struct 
> page *oldpage)
>       stable_node = page_stable_node(newpage);
>       if (stable_node) {
>               VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> -             stable_node->kpfn = page_to_pfn(newpage);
> +
> +             if (ksm_merge_across_nodes ||
> +                             page_to_nid(oldpage) == page_to_nid(newpage))
> +                     stable_node->kpfn = page_to_pfn(newpage);
> +             else
> +                     remove_node_from_stable_tree(stable_node);

Important point from Andrea discussed elsewhere.

>       }
>  }
>  #endif /* CONFIG_MIGRATION */
> @@ -1768,15 +1815,19 @@ static struct stable_node 
> *ksm_check_stable_tree(unsigned long start_pfn,
>                                                unsigned long end_pfn)
>  {
>       struct rb_node *node;
> +     int i;

I'd call it nid myself.

>  
> -     for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> -             struct stable_node *stable_node;
> +     for (i = 0; i < MAX_NUMNODES; i++)

It's irritating to have to do this outer nid loop, but I think you're
right, that the memory hotremove notification does not quite tell us
which node to look at.  Or can we derive that from start_pfn?  Would
it be safe to assume that end_pfn-1 must be in the same node?

> +             for (node = rb_first(&root_stable_tree[i]); node;
> +                             node = rb_next(node)) {
> +                     struct stable_node *stable_node;
> +
> +                     stable_node = rb_entry(node, struct stable_node, node);
> +                     if (stable_node->kpfn >= start_pfn &&
> +                         stable_node->kpfn < end_pfn)
> +                             return stable_node;
> +             }
>  
> -             stable_node = rb_entry(node, struct stable_node, node);
> -             if (stable_node->kpfn >= start_pfn &&
> -                 stable_node->kpfn < end_pfn)
> -                     return stable_node;
> -     }
>       return NULL;
>  }
>  
> @@ -1926,6 +1977,47 @@ static ssize_t run_store(struct kobject *kobj, struct 
> kobj_attribute *attr,
>  }
>  KSM_ATTR(run);
>  
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> +                             struct kobj_attribute *attr, char *buf)
> +{
> +     return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> +                                struct kobj_attribute *attr,
> +                                const char *buf, size_t count)
> +{
> +     int err;
> +     unsigned long knob;
> +
> +     err = kstrtoul(buf, 10, &knob);
> +     if (err)
> +             return err;
> +     if (knob > 1)
> +             return -EINVAL;
> +
> +     mutex_lock(&ksm_thread_mutex);
> +     if (ksm_run & KSM_RUN_MERGE) {
> +             err = -EBUSY;
> +     } else {
> +             if (ksm_merge_across_nodes != knob) {
> +                     if (ksm_pages_shared > 0)
> +                             err = -EBUSY;
> +                     else
> +                             ksm_merge_across_nodes = knob;
> +             }
> +     }
> +
> +     if (err)
> +             count = err;
> +     mutex_unlock(&ksm_thread_mutex);
> +
> +     return count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
>  static ssize_t pages_shared_show(struct kobject *kobj,
>                                struct kobj_attribute *attr, char *buf)
>  {
> @@ -1980,6 +2072,9 @@ static struct attribute *ksm_attrs[] = {
>       &pages_unshared_attr.attr,
>       &pages_volatile_attr.attr,
>       &full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> +     &merge_across_nodes_attr.attr,
> +#endif
>       NULL,
>  };
>  
> -- 
> 1.7.11.4

Looks nice - thank you.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to