On Tue, 22 Oct 2002, Seigo Tanimura wrote: > Introduction: > > The I/O buffer of the kernel are currently allocated in buffer_map > sized statically upon boot, and never grows. This limits the scale of > I/O performance on a host with large physical memory. We used to tune > NBUF to cope with that problem. This workaround, however, results in > a lot of wired pages not available for user processes, which is not > acceptable for memory-bound applications. > > In order to run both I/O-bound and memory-bound processes on the same > host, it is essential to achieve: > > A) allocation of buffer from kernel_map to break the limit of a map > size, and > > B) page reclaim from idle buffers to regulate the number of wired > pages. > > The patch at: > > http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz
I should be the last to defend the current design and implementation of the buffer cache, since I think it gets almost everything wrong (the implementation is OK, but has vast complications to work around design errors), but I think buffer_map is one of the things that it gets right (if we're going to have buffers at all). Some history of this problem: FreeBSD-1: Allocating from kernel_map instead of buffer_map would almost take us back to FreeBSD-1 where buffers were allocated from kmem_map using malloc(). This caused larger problems with fragmentation. Some of these were due to foot-shooting, but I think large-memory machines give essentially the same problems and complete fragmentation of kernel_map would cause more problems than complete fragmentation of any other map. Part of the foot-shooting was to allocate too little vm to the kernel and correspondingly too little vm to kmem_map. The (i386) kernel was at originally at 0xFE000000, so there was only 32MB of kernel vm. 32MB was far too small even for the relatively small physical memories at the time (1992 or 1993), so this was changed to 0xF0000000 in FreeBSD-1.1.5. Then there was 256MB of kernel vm. I suspect that this increase reduced the fragmentation problems to insignificance in most but not all cases. Some of the interesting cases at the time of FreeBSD-1 were: - machines with a small amount of physical memory. These should have few problems since there is not enough physical memory to make the maps more than sparse (unless the maps are undersized). - machines with a not so small amount of physical memory. It's possible that the too-small-in-general value for nbuf limits problems. - machines which only use one type of filesystem with one (small?) block size. If all allocations have the same size, then there need be no fragmentation. I'm not sure how strong this effect was in FreeBSD-1. malloc() used a power-of-2 algorithm, but only up to a certain size which covered 4K-blocks but possibly not 8K-blocks. Note that machines with large amounts of memory were likely to be specialized machines so were likely to take advantage of this without really trying, just by not mounting or not significantly using unusual filesystems like msdosfs, etx2fs and cd9660. I used the following allocation policies in my version of FreeBSD-1.1.5: - enlarge nbuf and the limit on buffer space (freebufspace) by a factor of 2 or 4 to get a larger buffer cache - enlarge nbuf by another factor of 8, but don't enlarge freebufspzce, so that buffers of size 512 can hold as much as buffers of size 4096. I didn't care about buffers of size 8192 or larger at the time. - actually enforce the freebufspace limit by discarding buffers in allocbuf() using a simplistic algorithm. This worked well enough, but I only tested it on a 486's with 8-16MB. The buffer cache had size 2MB or so. End of FreeBSD-1 history. FreeBSD-[2-5]: Use of buffer_map was somehow implemented at the beginning in rev.1.2 of vfs_bio.c although this wasn't in FreeBSD-1.1.5. Either I'm missing some history or it was only in dyson's tree for FreeBSD-1. Rev.1.2 used buffer map in its purest form: each of nbuf buffers has a data buffer consisting of MAXBSIZE bytes of vm attached to it at bufinit() time. The allocation never changes and we simply map physical pages into the vm when we have actual data. The problems with this are that MAXBSIZE is rather large and nbuf should be rather large (and/or dynamic). Subsequent changes add vast complications to reduce the amount of vm. I think these complications should only exist on machines with limited amounts of vm (mainly i386's). One of the complications was to reintroduce fragmentation problems. buffer_map only has enough space for nbuf buffers of size BKVASIZE, and the mappings are not statically allocated. Another of the complications is to discard buffers to reduce the fragmentation problems. Perhaps similar defragmentation would have worked well enough in FreeBSD-1.1. I suspect that you change depends on this defragmentation, but I don't think the defragmentation can work as well, since it can only touch buffers and not collateral fragmentation of kernel_map. I use the following changes in -current to enlarge the buffer cache and avoid fragmentation. These only work because I don't have much physical memory (512MB max). Even i386's have enough vm for the pure form of buffer_map to work: - enlarge BKVASIZE to MAXBSIZE so that fragmentation can not (should not?) occur. - enlarge nbuf by a factor of (my_BKVASIZE / current_BKVASIZE) to work around bugs. The point of BKVASIZE got lost somewhere. - enlarge nbuf and associated variables by another factor of 2 or 4 to get a larger buffer cache. This is marginal for 512MB physical, and probably wouldn't work if I had a lot of mbufs. nbuf is about 4000 and buffer_map takes about 256MB. 256MB is a lot, but nbuf = 4000 isn't a lot. I used buffer caches with 2000 * 1K buffers under Minix and Linux before FreeBSD, and ISTR having an nbuf of 5000 or so in FreeBSD-1.1. At least 2880 buffers are needed to properly cache a tiny 1.44MB floppy with an msdosfs file system with a block size of 512, and that was an important test case. End of FreeBSD-[2-5] history. > implements buffer allocation from kernel_map and reclaim of buffer > pages. With this patch, make kernel-depend && make kernel completes > about 30-60 seconds faster on my PC. I don't understand how you got such large improvements. My changes make very little difference in -current, although they once made a larger difference. At one point there were significant pessimizations in the buffer cache, but I thought that they were fixed. The pessimizations involved doing lots of remappings and/or lots of reconstitutions of buffers. These are very expensive operations. The remapping alone took longer than copying the data at 100MB/sec on a Celeron/366-overclocked. Perhaps your test is hitting a pessimized case. > Experimental Evaluation and Results: > > The times taken to complete make kernel-depend && make kernel just > after booting into single-user mode have been measured on my ThinkPad > 600E (CPU: Pentium II 366MHz, RAM: 160MB) by time(1). The number > passed to the -j option of make(1) has been varied from 1 to 30 in > order to control the pressure of the memory demand for user processes. > The baseline is the kernel without my patch. > > The following table shows the results. All of the times are in > seconds. > > -j baseline w/ my patch > real user sys real user sys > 1 1608.21 1387.94 125.96 1577.88 1391.02 100.90 > 10 1576.10 1360.17 132.76 1531.79 1347.30 103.60 > 20 1568.01 1280.89 133.22 1509.36 1276.75 104.69 > 30 1923.42 1215.00 155.50 1865.13 1219.07 113.43 > > Most of the improvements in the real times are accomplished by the > speedup of system calls. The hit ratio of getblk() may be increased, > but not examined yet. I think the improvments can only be explained by reduced thrashing of something (probably not just the buffer cache itself due to nbuf being small). I thought that my 133 seconds for compiling a kernel (make depend; make) on an Athlon1400 was slow :-). It took only 85 seconds a year ago. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message