Running will-it-scale/page_fault1 process mode workload on a 2 sockets
Intel Skylake server showed severe lock contention of zone->lock, as
high as about 80%(43% on allocation path and 38% on free path) CPU
cycles are burnt spinning. With perf, the most time consuming part inside
that lock on free path is cache missing on page structures, mostly on
the to-be-freed page's buddy due to merging.

One way to avoid this overhead is not do any merging at all for order-0
pages and leave the need for high order pages to compaction. With this
approach, the lock contention for zone->lock on free path dropped to 4%
but allocation side still has as high as 43% lock contention. In the
meantime, the dropped lock contention on free side doesn't translate to
performance increase, instead, it's consumed by increased lock contention
of the per node lru_lock(rose from 2% to 33%).

One concern of this approach is, how much impact does it have on high order
page allocation, like for order-9 pages? I have run the stress-highalloc
workload on a Haswell Desktop(8 CPU/4G memory) sometime ago and it showed
similar success rate for vanilla kernel and the patched kernel, both at 74%.
Note that it indeed took longer to finish the test: 244s vs 218s.

1 vanilla
  Attempted allocations:          1917
  Failed allocs:                   494
  Success allocs:                 1423
  % Success:                        74
  Duration alloctest pass:        218s

2 no_merge_in_buddy
  Attempted allocations:         1917
  Failed allocs:                  497
  Success allocs:                1420
  % Success:                       74
  Duration alloctest pass:       244s

The above test was done with the default --ms-delay=100, which means there
is a delay of 100ms between page allocations. If I set the delay to 1ms,
the success rate of this patch will drop to 36% while vanilla could maintain
at about 70%. Though 1ms delay may not be practical, it indeed shows the
possible impact of this patch on high order page allocation.

The next patch deals with allocation path zone->lock contention.

Suggested-by: Dave Hansen <dave.han...@intel.com>
Signed-off-by: Aaron Lu <aaron...@intel.com>
---
 mm/compaction.c |  28 ++++++++++++++
 mm/internal.h   |  15 +++++++-
 mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++--------------------
 3 files changed, 117 insertions(+), 42 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 10cd757f1006..b53c4d420533 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -669,6 +669,28 @@ static bool too_many_isolated(struct zone *zone)
        return isolated > (inactive + active) / 2;
 }
 
+static int merge_page(struct zone *zone, struct page *page, unsigned long pfn)
+{
+       int order = 0;
+       unsigned long buddy_pfn = __find_buddy_pfn(pfn, order);
+       struct page *buddy = page + (buddy_pfn - pfn);
+
+       /* Only do merging if the merge skipped page's buddy is also free */
+       if (PageBuddy(buddy)) {
+               int mt = get_pageblock_migratetype(page);
+               unsigned long flags;
+
+               spin_lock_irqsave(&zone->lock, flags);
+               if (likely(page_merge_skipped(page))) {
+                       do_merge(zone, page, mt);
+                       order = page_order(page);
+               }
+               spin_unlock_irqrestore(&zone->lock, flags);
+       }
+
+       return order;
+}
+
 /**
  * isolate_migratepages_block() - isolate all migrate-able pages within
  *                               a single pageblock
@@ -777,6 +799,12 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
                 */
                if (PageBuddy(page)) {
                        unsigned long freepage_order = page_order_unsafe(page);
+                       /*
+                        * If the page didn't do merging on free time, now do
+                        * it since we are doing compaction.
+                        */
+                       if (page_merge_skipped(page))
+                               freepage_order = merge_page(zone, page, 
low_pfn);
 
                        /*
                         * Without lock, we cannot be sure that what we got is
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..d2b0ac02d459 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -228,9 +228,22 @@ int find_suitable_fallback(struct free_area *area, 
unsigned int order,
 static inline unsigned int page_order(struct page *page)
 {
        /* PageBuddy() must be checked by the caller */
-       return page_private(page);
+       return page_private(page) & ~(1 << 16);
 }
 
+/*
+ * This function returns if the page is in buddy but didn't do any merging
+ * for performance reason. This function only makes sense if PageBuddy(page)
+ * is also true. The caller should hold zone->lock for this function to return
+ * correct value, or it can handle invalid values gracefully.
+ */
+static inline bool page_merge_skipped(struct page *page)
+{
+       return PageBuddy(page) && (page->private & (1 << 16));
+}
+
+void do_merge(struct zone *zone, struct page *page, int migratetype);
+
 /*
  * Like page_order(), but for callers who cannot afford to hold the zone lock.
  * PageBuddy() should be checked first by the caller to minimize race window,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ac7fa97dd55..9497c8c5f808 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -781,49 +781,14 @@ static inline int page_is_buddy(struct page *page, struct 
page *buddy,
        return 0;
 }
 
-/*
- * Freeing function for a buddy system allocator.
- *
- * The concept of a buddy system is to maintain direct-mapped table
- * (containing bit values) for memory blocks of various "orders".
- * The bottom level table contains the map for the smallest allocatable
- * units of memory (here, pages), and each level above it describes
- * pairs of units from the levels below, hence, "buddies".
- * At a high level, all that happens here is marking the table entry
- * at the bottom level available, and propagating the changes upward
- * as necessary, plus some accounting needed to play nicely with other
- * parts of the VM system.
- * At each level, we keep a list of pages, which are heads of continuous
- * free pages of length of (1 << order) and marked with _mapcount
- * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
- * field.
- * So when we are allocating or freeing one, we can derive the state of the
- * other.  That is, if we allocate a small block, and both were
- * free, the remainder of the region must be split into blocks.
- * If a block is freed, and its buddy is also free, then this
- * triggers coalescing into a block of larger size.
- *
- * -- nyc
- */
-
-static inline void __free_one_page(struct page *page,
-               unsigned long pfn,
-               struct zone *zone, unsigned int order,
-               int migratetype)
+static void inline __do_merge(struct page *page, unsigned int order,
+                               struct zone *zone, int migratetype)
 {
+       unsigned int max_order = min_t(unsigned int, MAX_ORDER, pageblock_order 
+ 1);
+       unsigned long pfn = page_to_pfn(page);
        unsigned long combined_pfn;
        unsigned long uninitialized_var(buddy_pfn);
        struct page *buddy;
-       unsigned int max_order;
-
-       max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
-
-       VM_BUG_ON(!zone_is_initialized(zone));
-       VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
-
-       VM_BUG_ON(migratetype == -1);
-       if (likely(!is_migrate_isolate(migratetype)))
-               __mod_zone_freepage_state(zone, 1 << order, migratetype);
 
        VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
        VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -879,8 +844,6 @@ static inline void __free_one_page(struct page *page,
        }
 
 done_merging:
-       set_page_order(page, order);
-
        /*
         * If this is not the largest possible page, check if the buddy
         * of the next-highest order is free. If it is, it's possible
@@ -905,9 +868,80 @@ static inline void __free_one_page(struct page *page,
 
        list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
 out:
+       set_page_order(page, order);
        zone->free_area[order].nr_free++;
 }
 
+void do_merge(struct zone *zone, struct page *page, int migratetype)
+{
+       VM_BUG_ON(page_order(page) != 0);
+
+       list_del(&page->lru);
+       zone->free_area[0].nr_free--;
+       rmv_page_order(page);
+
+       __do_merge(page, 0, zone, migratetype);
+}
+
+static inline bool should_skip_merge(struct zone *zone, unsigned int order)
+{
+#ifdef CONFIG_COMPACTION
+       return !zone->compact_considered && !order;
+#else
+       return false;
+#endif
+}
+
+/*
+ * Freeing function for a buddy system allocator.
+ *
+ * The concept of a buddy system is to maintain direct-mapped table
+ * (containing bit values) for memory blocks of various "orders".
+ * The bottom level table contains the map for the smallest allocatable
+ * units of memory (here, pages), and each level above it describes
+ * pairs of units from the levels below, hence, "buddies".
+ * At a high level, all that happens here is marking the table entry
+ * at the bottom level available, and propagating the changes upward
+ * as necessary, plus some accounting needed to play nicely with other
+ * parts of the VM system.
+ * At each level, we keep a list of pages, which are heads of continuous
+ * free pages of length of (1 << order) and marked with _mapcount
+ * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
+ * field.
+ * So when we are allocating or freeing one, we can derive the state of the
+ * other.  That is, if we allocate a small block, and both were
+ * free, the remainder of the region must be split into blocks.
+ * If a block is freed, and its buddy is also free, then this
+ * triggers coalescing into a block of larger size.
+ *
+ * -- nyc
+ */
+static inline void __free_one_page(struct page *page,
+               unsigned long pfn,
+               struct zone *zone, unsigned int order,
+               int migratetype)
+{
+       VM_BUG_ON(!zone_is_initialized(zone));
+       VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+
+       VM_BUG_ON(migratetype == -1);
+       if (likely(!is_migrate_isolate(migratetype)))
+               __mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+       if (should_skip_merge(zone, order)) {
+               list_add(&page->lru, 
&zone->free_area[order].free_list[migratetype]);
+               /*
+                * 1 << 16 set on page->private to indicate this order0
+                * page skipped merging during free time
+                */
+               set_page_order(page, order | (1 << 16));
+               zone->free_area[order].nr_free++;
+               return;
+       }
+
+       __do_merge(page, order, zone, migratetype);
+}
+
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
-- 
2.14.3

Reply via email to