If we exhaust the reserves in the page allocator when PF_MEMALLOC is set
then no longer give up but call into reclaim with PF_MEMALLOC set.

This is in essence a recursive call back into page reclaim with another
page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential
allocations with __GFP_NOMEMALLOC set will not enter that branch again.

Allocation under PF_MEMALLOC will no longer run out outmemory if there 
memory that is reclaimable without additional memory
allocations.

In order to make allocation-less reclaim working we need to avoid writing
pages out or swapping. So on entry to try_to_free_pages() we check for
__GFP_NOMEMALLOC. If it is set then sc.may_writepage and sc.mayswap are
switched off and we short circuit the writeout throttling.

The types of pages that can be reclaimed by a call to try_to_free_pages()
with the __GFP_NOMEMALLOC parameter are:

- Unmapped clean page cache pages.
- Mapped clean pages
- slab shrinking

We print a warning if we get into the special reclaim mode because
this means that the reserves are too low.

Changes
RFC->v1
- Allow slab shrinking in recursive reclaim (is protected by a
  semaphore and already had to deal with allocs failing under
  PF_MEMALLOC)
- Add printk to show that recursive reclaim is being used.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/vmscan.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c  2007-08-23 13:28:32.000000000 -0700
+++ linux-2.6/mm/vmscan.c       2007-08-23 13:32:42.000000000 -0700
@@ -1106,7 +1106,8 @@ static unsigned long shrink_zone(int pri
                }
        }
 
-       throttle_vm_writeout(sc->gfp_mask);
+       if (!(sc->gfp_mask & __GFP_NOMEMALLOC))
+               throttle_vm_writeout(sc->gfp_mask);
 
        atomic_dec(&zone->reclaim_in_progress);
        return nr_reclaimed;
@@ -1168,6 +1169,9 @@ static unsigned long shrink_zones(int pr
  * hope that some of these pages can be written.  But if the allocating task
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
+ *
+ * The __GFP_NOMEMALLOC flag has a special role. If it is set then no memory
+ * allocations or writeout will occur.
  */
 unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 {
@@ -1180,15 +1184,21 @@ unsigned long try_to_free_pages(struct z
        int i;
        struct scan_control sc = {
                .gfp_mask = gfp_mask,
-               .may_writepage = !laptop_mode,
                .swap_cluster_max = SWAP_CLUSTER_MAX,
-               .may_swap = 1,
                .swappiness = vm_swappiness,
                .order = order,
        };
 
        count_vm_event(ALLOCSTALL);
 
+       if (gfp_mask & __GFP_NOMEMALLOC) {
+               if (printk_ratelimited())
+                       printk(KERN_WARNING "Entering recursive reclaim due "
+                                       "to depleted memory reserves\n");
+       } else {
+               sc.may_writepage = !laptop_mode;
+               sc.may_swap = 1;
+       }
        for (i = 0; zones[i] != NULL; i++) {
                struct zone *zone = zones[i];
 
@@ -1215,6 +1225,9 @@ unsigned long try_to_free_pages(struct z
                        goto out;
                }
 
+               if (!(gfp_mask & __GFP_NOMEMALLOC))
+                       continue;
+
                /*
                 * Try to write back as many pages as we just scanned.  This
                 * tends to cause slow streaming writers to write data to the
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c      2007-08-23 13:34:50.000000000 -0700
+++ linux-2.6/mm/page_alloc.c   2007-08-23 13:36:59.000000000 -0700
@@ -1319,6 +1319,20 @@ nofail_alloc:
                                zonelist, ALLOC_NO_WATERMARKS);
                        if (page)
                                goto got_pg;
+                       /*
+                        * No memory is available at all.
+                        *
+                        * However, if we are already in reclaim then the
+                        * reclaim_state etc is already setup. Simply call
+                        * try_to_get_free_pages() with PF_MEMALLOC which
+                        * will reclaim without the need to allocate more
+                        * memory.
+                        */
+                       if (p->flags & PF_MEMALLOC && wait &&
+                               try_to_free_pages(zonelist->zones, order,
+                                               gfp_mask | __GFP_NOMEMALLOC))
+                               goto restart;
+
                        if (gfp_mask & __GFP_NOFAIL) {
                                congestion_wait(WRITE, HZ/50);
                                goto nofail_alloc;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to