Re: Dynamic growth of the buffer and buffer page reclaim

Bruce Evans Tue, 22 Oct 2002 23:06:30 -0700

On Tue, 22 Oct 2002, Seigo Tanimura wrote:

> Introduction:
>
> The I/O buffer of the kernel are currently allocated in buffer_map
> sized statically upon boot, and never grows.  This limits the scale of
> I/O performance on a host with large physical memory.  We used to tune
> NBUF to cope with that problem.  This workaround, however, results in
> a lot of wired pages not available for user processes, which is not
> acceptable for memory-bound applications.
>
> In order to run both I/O-bound and memory-bound processes on the same
> host, it is essential to achieve:
>
> A) allocation of buffer from kernel_map to break the limit of a map
>    size, and
>
> B) page reclaim from idle buffers to regulate the number of wired
>    pages.
>
> The patch at:
>
> http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz


I should be the last to defend the current design and implementation of
the buffer cache, since I think it gets almost everything wrong (the
implementation is OK, but has vast complications to work around design
errors), but I think buffer_map is one of the things that it gets right
(if we're going to have buffers at all).

Some history of this problem:

FreeBSD-1:

Allocating from kernel_map instead of buffer_map would almost take us
back to FreeBSD-1 where buffers were allocated from kmem_map using
malloc().  This caused larger problems with fragmentation.  Some of
these were due to foot-shooting, but I think large-memory machines
give essentially the same problems and complete fragmentation of
kernel_map would cause more problems than complete fragmentation of
any other map.  Part of the foot-shooting was to allocate too little
vm to the kernel and correspondingly too little vm to kmem_map.  The
(i386) kernel was at originally at 0xFE000000, so there was only 32MB
of kernel vm.  32MB was far too small even for the relatively small
physical memories at the time (1992 or 1993), so this was changed to
0xF0000000 in FreeBSD-1.1.5.  Then there was 256MB of kernel vm.  I
suspect that this increase reduced the fragmentation problems to
insignificance in most but not all cases.

Some of the interesting cases at the time of FreeBSD-1 were:
- machines with a small amount of physical memory.  These should have
  few problems since there is not enough physical memory to make the
  maps more than sparse (unless the maps are undersized).
- machines with a not so small amount of physical memory.  It's possible
  that the too-small-in-general value for nbuf limits problems.
- machines which only use one type of filesystem with one (small?) block
  size.  If all allocations have the same size, then there need be no
  fragmentation.  I'm not sure how strong this effect was in FreeBSD-1.
  malloc() used a power-of-2 algorithm, but only up to a certain size
  which covered 4K-blocks but possibly not 8K-blocks.  Note that machines
  with large amounts of memory were likely to be specialized machines so
  were likely to take advantage of this without really trying, just by
  not mounting or not significantly using unusual filesystems like
  msdosfs, etx2fs and cd9660.

I used the following allocation policies in my version of FreeBSD-1.1.5:
- enlarge nbuf and the limit on buffer space (freebufspace) by a factor
  of 2 or 4 to get a larger buffer cache
- enlarge nbuf by another factor of 8, but don't enlarge freebufspzce,
  so that buffers of size 512 can hold as much as buffers of size 4096.
  I didn't care about buffers of size 8192 or larger at the time.
- actually enforce the freebufspace limit by discarding buffers in
  allocbuf() using a simplistic algorithm.
This worked well enough, but I only tested it on a 486's with 8-16MB.
The buffer cache had size 2MB or so.

End of FreeBSD-1 history.

FreeBSD-[2-5]:

Use of buffer_map was somehow implemented at the beginning in rev.1.2
of vfs_bio.c although this wasn't in FreeBSD-1.1.5.  Either I'm missing
some history or it was only in dyson's tree for FreeBSD-1.  Rev.1.2 used
buffer map in its purest form: each of nbuf buffers has a data buffer
consisting of MAXBSIZE bytes of vm attached to it at bufinit() time.
The allocation never changes and we simply map physical pages into the
vm when we have actual data.  The problems with this are that MAXBSIZE
is rather large and nbuf should be rather large (and/or dynamic).
Subsequent changes add vast complications to reduce the amount of vm.
I think these complications should only exist on machines with limited
amounts of vm (mainly i386's).

One of the complications was to reintroduce fragmentation problems.
buffer_map only has enough space for nbuf buffers of size BKVASIZE,
and the mappings are not statically allocated.  Another of the
complications is to discard buffers to reduce the fragmentation problems.
Perhaps similar defragmentation would have worked well enough in
FreeBSD-1.1.  I suspect that you change depends on this defragmentation,
but I don't think the defragmentation can work as well, since it can
only touch buffers and not collateral fragmentation of kernel_map.

I use the following changes in -current to enlarge the buffer cache and
avoid fragmentation.  These only work because I don't have much physical
memory (512MB max).  Even i386's have enough vm for the pure form of
buffer_map to work:
- enlarge BKVASIZE to MAXBSIZE so that fragmentation can not (should not?)
  occur.
- enlarge nbuf by a factor of (my_BKVASIZE / current_BKVASIZE) to work
  around bugs.  The point of BKVASIZE got lost somewhere.
- enlarge nbuf and associated variables by another factor of 2 or 4 to
  get a larger buffer cache.
This is marginal for 512MB physical, and probably wouldn't work if I had
a lot of mbufs.  nbuf is about 4000 and buffer_map takes about 256MB.
256MB is a lot, but nbuf = 4000 isn't a lot.  I used buffer caches
with 2000 * 1K buffers under Minix and Linux before FreeBSD, and ISTR
having an nbuf of 5000 or so in FreeBSD-1.1.  At least 2880 buffers are
needed to properly cache a tiny 1.44MB floppy with an msdosfs file
system with a block size of 512, and that was an important test case.

End of FreeBSD-[2-5] history.

> implements buffer allocation from kernel_map and reclaim of buffer
> pages.  With this patch, make kernel-depend && make kernel completes
> about 30-60 seconds faster on my PC.

I don't understand how you got such large improvements.  My changes
make very little difference in -current, although they once made a
larger difference.  At one point there were significant pessimizations
in the buffer cache, but I thought that they were fixed.  The
pessimizations involved doing lots of remappings and/or lots of
reconstitutions of buffers.  These are very expensive operations.  The
remapping alone took longer than copying the data at 100MB/sec on a
Celeron/366-overclocked.  Perhaps your test is hitting a pessimized
case.

> Experimental Evaluation and Results:
>
> The times taken to complete make kernel-depend && make kernel just
> after booting into single-user mode have been measured on my ThinkPad
> 600E (CPU: Pentium II 366MHz, RAM: 160MB) by time(1).  The number
> passed to the -j option of make(1) has been varied from 1 to 30 in
> order to control the pressure of the memory demand for user processes.
> The baseline is the kernel without my patch.
>
> The following table shows the results.  All of the times are in
> seconds.
>
> -j    baseline                w/ my patch
>       real    user    sys     real    user    sys
> 1     1608.21 1387.94 125.96  1577.88 1391.02 100.90
> 10    1576.10 1360.17 132.76  1531.79 1347.30 103.60
> 20    1568.01 1280.89 133.22  1509.36 1276.75 104.69
> 30    1923.42 1215.00 155.50  1865.13 1219.07 113.43
>
> Most of the improvements in the real times are accomplished by the
> speedup of system calls.  The hit ratio of getblk() may be increased,
> but not examined yet.

I think the improvments can only be explained by reduced thrashing of
something (probably not just the buffer cache itself due to nbuf being
small).

I thought that my 133 seconds for compiling a kernel (make depend; make)
on an Athlon1400 was slow :-).  It took only 85 seconds a year ago.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: Dynamic growth of the buffer and buffer page reclaim

Reply via email to