On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote: > Rusty Russell wrote: > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > > >>So I opt for it being made tunable, safe, and turned off by default. > > I hate tunables :) Unless we have workload A that gets a reasonable > benefit from something and workload B that gets a significant regression, > and no clear way to reconcile them...
Me too ;) But sometimes we really want to avoid flushing the cache. Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here. > >I'd like to see it turned on by default in -mm, and try to come up with > >some server-like workload to measure the effect. Should be easy to > >simulate something (eg. apache server, where clients grab some files in > >preference, and apache server where clients grab different files). > > I don't like this kind of conditional information going from something > like readahead into page reclaim. Unless it is for readahead _specific_ > data such as "I got these all wrong, so you can reclaim them" (which > this isn't). > > Possibly it makes sense to realise that the given pages are cheaper > to read back in as they are apparently being read-ahead very nicely. In fact I have talked to Jens about it in last year's kernel summit. The patch below explains itself. --- Subject: cost based page reclaim Cost based page reclaim - a minimalist implementation. Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a large file. Should we first drop the 32-pages which are read in one I/O, or drop the 32 distinct pages, each costs one I/O? (Given that the files are of equal hotness.) Page replacement algorithms should be designed to minimize the number of I/Os, instead of the number of page faults. Dividing the cost of I/O by the number of pages it bring in, we get the cost of the page. The bigger page cost, the more 'lives/bloods' the page should have. This patch adds life to costly pages by pretending that they are referenced more times. Possible downsides: - burdens the pressure of vmscan - active pages are no longer that 'active' Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]> --- include/linux/backing-dev.h | 1 + mm/readahead.c | 27 +++++++++++++++++++++++++++ mm/swap.c | 5 ++++- 3 files changed, 32 insertions(+), 1 deletion(-) --- linux-2.6.22.orig/mm/readahead.c +++ linux-2.6.22/mm/readahead.c @@ -125,6 +125,7 @@ static void ra_account(struct file_ra_st struct backing_dev_info default_backing_dev_info = { .ra_pages = MAX_RA_PAGES, + .avg_ra_size = MAX_RA_PAGES * PAGE_CACHE_SIZE, .ra_pages0 = INITIAL_RA_PAGES, .ra_thrash_bytes = MAX_RA_PAGES * PAGE_CACHE_SIZE, .state = 0, @@ -133,6 +134,26 @@ struct backing_dev_info default_backing_ }; EXPORT_SYMBOL_GPL(default_backing_dev_info); +#define log2(n) fls(n) +static int readahead_cost_per_page(struct address_space *mapping, + unsigned long pages) +{ + unsigned long avg; + int cost; + + avg = mapping->backing_dev_info->avg_ra_size; + avg = (pages * PAGE_CACHE_SIZE + avg * 1023) / 1024; + mapping->backing_dev_info->avg_ra_size = avg; + + avg = avg / PAGE_CACHE_SIZE; + if (!avg || !pages) + cost = 0; + else + cost = (log2(avg) - log2(pages)) * 5 / log2(avg); + + return cost; +} + /* * Initialise a struct file's readahead state. Assumes that the caller has * memset *ra to zero. @@ -418,6 +439,7 @@ __do_page_cache_readahead(struct address struct page *page; unsigned long end_index; /* The last page we want to read */ LIST_HEAD(page_pool); + int cost; int page_idx; int ret = 0; loff_t isize = i_size_read(inode); @@ -426,6 +448,7 @@ __do_page_cache_readahead(struct address goto out; end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); + cost = readahead_cost_per_page(mapping, nr_to_read); /* * Preallocate as many pages as we will need. @@ -448,6 +471,10 @@ __do_page_cache_readahead(struct address if (!page) break; page->index = page_offset; + if (cost >= 3) + SetPageActive(page); + else if (cost == 2) + SetPageReferenced(page); list_add(&page->lru, &page_pool); if (page_idx == nr_to_read - lookahead_size) SetPageReadahead(page); --- linux-2.6.22.orig/mm/swap.c +++ linux-2.6.22/mm/swap.c @@ -365,7 +365,10 @@ void __pagevec_lru_add(struct pagevec *p } VM_BUG_ON(PageLRU(page)); SetPageLRU(page); - add_page_to_inactive_list(zone, page); + if (!PageActive(page)) + add_page_to_inactive_list(zone, page); + else + add_page_to_active_list(zone, page); } if (zone) spin_unlock_irq(&zone->lru_lock); --- linux-2.6.22.orig/include/linux/backing-dev.h +++ linux-2.6.22/include/linux/backing-dev.h @@ -26,6 +26,7 @@ typedef int (congested_fn)(void *, int); struct backing_dev_info { unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ + unsigned long avg_ra_size; unsigned long ra_pages0; /* min readahead on start of file */ unsigned long ra_thrash_bytes; /* estimated thrashing threshold */ unsigned long state; /* Always use atomic bitops on this */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/