On Fri, Oct 05, 2007 at 10:20:05AM -0700, Andrew Morton wrote: > On Fri, 5 Oct 2007 20:30:28 +0800 > Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09 > > > Author: akpm <akpm> > > > Date: Sun Dec 22 01:07:33 2002 +0000 > > > > > > [PATCH] Give kswapd writeback higher priority than pdflush > > > > > > The `low latency page reclaim' design works by preventing page > > > allocators from blocking on request queues (and by preventing them > > > from > > > blocking against writeback of individual pages, but that is immaterial > > > here). > > > > > > This has a problem under some situations. pdflush (or a write(2) > > > caller) could be saturating the queue with highmem pages. This > > > prevents anyone from writing back ZONE_NORMAL pages. We end up doing > > > enormous amounts of scenning. > > > > Sorry, I cannot not understand it. We now have balanced aging between > > zones. So the page allocations are expected to distribute proportionally > > between ZONE_HIGHMEM and ZONE_NORMAL? > > Sure, but we don't have one disk queue per disk per zone! The queue is > shared by all the zones. So if writeback from one zone has filled the > queue up, the kernel can't write back data from another zone.
Hmm, that's a problem. But I guess when one zone is full, other zones will not be far away... It's a "sooner or later" problem. > (Well, it can, by blocking in get_request_wait(), but that causes long and > uncontrollable latencies). I guess PF_SWAPWRITE processes still have good probability to stuck in get_request_wait(). Because balance_dirty_pages() are allowed to disregard the congestion. It will be exhausting the available request slots all the time. Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]> --- mm/page-writeback.c | 1 + 1 file changed, 1 insertion(+) --- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c +++ linux-2.6.23-rc8-mm2/mm/page-writeback.c @@ -400,6 +400,7 @@ static void balance_dirty_pages(struct a .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write = write_chunk, + .nonblocking = 1, .range_cyclic = 1, }; > > > A test case is to mmap(MAP_SHARED) almost all of a 4G machine's > > > memory, > > > then kill the mmapping applications. The machine instantly goes from > > > 0% of memory dirty to 95% or more. pdflush kicks in and starts > > > writing > > > the least-recently-dirtied pages, which are all highmem. The queue is > > > congested so nobody will write back ZONE_NORMAL pages. kswapd chews > > > 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim > > > efficiency (pages_reclaimed/pages_scanned) falls to 2%. > > > > > > So this patch changes the policy for kswapd. kswapd may use all of a > > > request queue, and is prepared to block on request queues. > > > > > > What will now happen in the above scenario is: > > > > > > 1: The page alloctor scans some pages, fails to reclaim enough > > > memory and takes a nap in blk_congetion_wait(). > > > > > > 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing > > > back pages. (These pages will be rotated to the tail of the > > > inactive list at IO-completion interrupt time). > > > > > > This writeback will saturate the queue with ZONE_NORMAL pages. > > > Conveniently, pdflush will avoid the congested queues. So we end > > > up > > > writing the correct pages. > > > > > > In this test, kswapd CPU utilisation falls from 50% to 2%, page > > > reclaim > > > efficiency rises from 2% to 40% and things are generally a lot > > > happier. > > > > We may see the same problem and improvement in the absent of 'all > > writeback goes to one zone' assumption. > > > > The problem could be: > > - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing > > data and quickly _congests_ the queue; > > Or someone ran fsync(), or pdflush is writing back data because it exceeded > dirty_writeback_centisecs, etc. Ah, yes. > > - dirty pages are slowly but continuously turned into clean pages by > > balance_dirty_pages(), but they still stay in the same place in LRU; > > - the zones are mostly dirty/writeback pages, kswapd has a hard time > > finding the randomly distributed clean pages; > > - kswapd cannot do the writeout because the queue is congested! > > > > The improvement could be: > > - kswapd is now explicitly preferred to do the writeout; > > - the pages written by kswapd will be rotated and easy for kswapd to > > reclaim; > > - it becomes possible for kswapd to wait for the congested queue, > > instead of doing the vmscan like mad. > > Yeah. In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd, > iirc) would throttle by waiting on writeout of a particular page. This was > a poor design, because writeback against a *particular* page can take > anywhere from one millisecond to thirty seconds to complete, depending upon > where the disk head is and all that stuff. > > The critical change I made was to switch the throttling algorithm from > "wait for one page to get written" to "wait for _any_ page to get written". > Becaue reclaim really doesn't care _which_ page got written: we want to > wake up and start scanning again when _any_ page got written. That must be a big improvement! > That's what congestion_wait() does. > > It is pretty crude. It could be that writeback completed against pages which > aren't in the correct zone, or it could be that some other task went and > allocated the just-cleaned pages before this task can get running and > reclaim them, or it could be that the just-written-back pages weren't > reclaimable after all, etc. > > It would take a mind-boggling amount of logic and locking to make all this > 100% accurate and the need has never been demonstrated. So page reclaim > presently should be viewed as a polling algorithm, where the rate of > polling is paced by the rate at which the IO system can retire writes. Yeah. So the polling overheads are limited. > > The congestion wait looks like a pretty natural way to throttle the kswapd. > > Instead of doing the vmscan at 1000MB/s and actually freeing pages at > > 60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at > > maybe 150MB/s. > > Something like that. > > The critical numbers to watch are /proc/vmstat's *scan* and *steal*. Look: > > akpm:/usr/src/25> uptime > 10:08:14 up 10 days, 16:46, 15 users, load average: 0.02, 0.05, 0.04 > akpm:/usr/src/25> grep steal /proc/vmstat > pgsteal_dma 0 > pgsteal_dma32 0 > pgsteal_normal 0 > pgsteal_high 0 > pginodesteal 0 > kswapd_steal 1218698 > kswapd_inodesteal 266847 > akpm:/usr/src/25> grep scan /proc/vmstat > pgscan_kswapd_dma 0 > pgscan_kswapd_dma32 1246816 > pgscan_kswapd_normal 0 > pgscan_kswapd_high 0 > pgscan_direct_dma 0 > pgscan_direct_dma32 448 > pgscan_direct_normal 0 > pgscan_direct_high 0 > slabs_scanned 2881664 > > Ignore kswapd_inodesteal and slabs_scanned. We see that this machine has > scanned 1246816+448 pages and has reclaimed (stolen) 1218698 pages. That's > a reclaim success rate of 97.7%, which is pretty damn good - this machine > is just a lightly-loaded 3GB desktop. > > When testing reclaim, it is critical that this ratio be monitored (vmmon.c > from ext3-tools is a vmstat-like interface to /proc/vmstat). If the > reclaim efficiency falls below, umm, 25% then things are getting into some > trouble. Nice tool! I've been watching the raw numbers by writing scripts. This one can be written as: #!/bin/bash while true; do while read a b do eval $a=$b done < /proc/vmstat uptime=$(</proc/uptime) scan=$(( pgscan_kswapd_dma + pgscan_kswapd_dma32 + pgscan_kswapd_normal + pgscan_kswapd_high + pgscan_direct_dma + pgscan_direct_dma32 + pgscan_direct_normal + pgscan_direct_high )) steal=$(( pgsteal_dma + pgsteal_dma32 + pgsteal_normal + pgsteal_high )) ratio=$((100*steal/(scan+1))) echo -e "$uptime\t$ratio%\t$steal\t$scan" sleep 1; done Not surprisingly, I see nice numbers on my desktop: 9517.99 9368.60 96% 1898452 1961536 > Actually, 25% is still pretty good. We scan 4 pages for each reclaimed > page, but the amount of wall time which that takes is vastly less than the > time to write one page, bearing in mind that these things tend to be seeky > as hell. But still, keeping an eye on the reclaim efficiency is just your > basic starting point for working on page reclaim. Thank you for the nice tip :-) Fengguang - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/