If we exhaust the reserves in the page allocator when PF_MEMALLOC is set then no longer give up but call into reclaim with PF_MEMALLOC set.
This is in essence a recursive call back into page reclaim with another page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential allocations with __GFP_NOMEMALLOC set will not enter that branch again. Allocation under PF_MEMALLOC will no longer run out outmemory if there memory that is reclaimable without additional memory allocations. In order to make allocation-less reclaim working we need to avoid writing pages out or swapping. So on entry to try_to_free_pages() we check for __GFP_NOMEMALLOC. If it is set then sc.may_writepage and sc.mayswap are switched off and we short circuit the writeout throttling. The types of pages that can be reclaimed by a call to try_to_free_pages() with the __GFP_NOMEMALLOC parameter are: - Unmapped clean page cache pages. - Mapped clean pages - slab shrinking We print a warning if we get into the special reclaim mode because this means that the reserves are too low. Changes RFC->v1 - Allow slab shrinking in recursive reclaim (is protected by a semaphore and already had to deal with allocs failing under PF_MEMALLOC) - Add printk to show that recursive reclaim is being used. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/vmscan.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c 2007-08-23 13:28:32.000000000 -0700 +++ linux-2.6/mm/vmscan.c 2007-08-23 13:32:42.000000000 -0700 @@ -1106,7 +1106,8 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc->gfp_mask); + if (!(sc->gfp_mask & __GFP_NOMEMALLOC)) + throttle_vm_writeout(sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; @@ -1168,6 +1169,9 @@ static unsigned long shrink_zones(int pr * hope that some of these pages can be written. But if the allocating task * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. + * + * The __GFP_NOMEMALLOC flag has a special role. If it is set then no memory + * allocations or writeout will occur. */ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) { @@ -1180,15 +1184,21 @@ unsigned long try_to_free_pages(struct z int i; struct scan_control sc = { .gfp_mask = gfp_mask, - .may_writepage = !laptop_mode, .swap_cluster_max = SWAP_CLUSTER_MAX, - .may_swap = 1, .swappiness = vm_swappiness, .order = order, }; count_vm_event(ALLOCSTALL); + if (gfp_mask & __GFP_NOMEMALLOC) { + if (printk_ratelimited()) + printk(KERN_WARNING "Entering recursive reclaim due " + "to depleted memory reserves\n"); + } else { + sc.may_writepage = !laptop_mode; + sc.may_swap = 1; + } for (i = 0; zones[i] != NULL; i++) { struct zone *zone = zones[i]; @@ -1215,6 +1225,9 @@ unsigned long try_to_free_pages(struct z goto out; } + if (!(gfp_mask & __GFP_NOMEMALLOC)) + continue; + /* * Try to write back as many pages as we just scanned. This * tends to cause slow streaming writers to write data to the Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2007-08-23 13:34:50.000000000 -0700 +++ linux-2.6/mm/page_alloc.c 2007-08-23 13:36:59.000000000 -0700 @@ -1319,6 +1319,20 @@ nofail_alloc: zonelist, ALLOC_NO_WATERMARKS); if (page) goto got_pg; + /* + * No memory is available at all. + * + * However, if we are already in reclaim then the + * reclaim_state etc is already setup. Simply call + * try_to_get_free_pages() with PF_MEMALLOC which + * will reclaim without the need to allocate more + * memory. + */ + if (p->flags & PF_MEMALLOC && wait && + try_to_free_pages(zonelist->zones, order, + gfp_mask | __GFP_NOMEMALLOC)) + goto restart; + if (gfp_mask & __GFP_NOFAIL) { congestion_wait(WRITE, HZ/50); goto nofail_alloc; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/