Re: NFS write() calls lead to read() calls?

Bruce Evans Wed, 28 Mar 2007 15:51:51 -0800

On Wed, 28 Mar 2007, Ulrich Spoerlein wrote:

hostA # scp 500MB hostB:/net/share/
...
If I run the scp again, I can see X MB/s going out from HostA, 2*X
MB/s coming in on HostB and X MB/s out plus X MB/s in on HostC. What's
happening is, that HostB issues one NFS READ call for every WRITE
call. The traffic flows like this:


----->   ----->
A        B        C
         <-----

If I rm(1) the file on the NFS share, then the first scp(1) will not
show this behaviour. It is only when overwritting files, that this
happens.


At least under FreeBSD-~5.2 with an old version of scp, this is caused
by blocksize bugs in the kernel and/or scp, and an open mode bug or
feature in scp.  The blocksize used by scp is 4K.  This is smaller
than the nfs block size of 8K, so nfs has to read-ahead 1 8K block for
each pair of 4K- blocks written so as to have non-garbage in the top
half of each 8K- block after writing 4K to the bottom half.  It only
has to read-ahead if there is something there, but repeated scp's
ensure this by not truncating the file on open (open mode (O_WRONLY |
O_CREAT) without O_TRUNC according to truss(1)).

The real weirdness comes into play, when I simply cp(1) from HostB
itself like this:

hostB # cp 500MB /net/share/

I can do this over and over again, and _never_ get any noteworthy
amount of NFS READ calls, only WRITE. The network traffic is also, as
you would expect.

Then I tested using ssh(1) instead of scp(1), like this:

hostA # cat 500MB | ssh hostB "cat >/net/share/500MB"

This works, too. Probably, because sh(1) is truncating the file?


cp truncates the file on open (open mode (O_WRONLY | O_TRUNC_ without
O_CREAT according to truss(1)).  cp also uses a block size of 64K, so
it wouldn't cause read-ahead even if it didn't truncate.

There are many possible wrong block sizes:

- on my server, the block size according to st_blksize is 16K (ffs default).

- on my client, the block size according to st_blksize is 512 due to bugs
  in nfs.  There is an open PR or two about this.  In nfs2, the file
  system's block size on the server is passed to the client for each file
  and used for st_blksize but nothing else, but in nfs3, the block
  size that is put in st_blksize by the client is hard-coded to the
  arbitrary (usually bad) value NFS_FABLKSIZE = 512.  The correct block
  size to put in st_blksize in both cases seems to be the least common
  multiple of the nfs buffer size and the server block size, since if
  the application's i/o size is smaller than the nfs buffer size then
  there will be excessive block size conversions in the nfs client,
  and if the i/o size is smaller than the server's block size then
  there will be excessive block size conversions in the server's file
  system.  nfs's buffer size is the maximum of the read size, the write
  size and the page size.  This is usually 8K, so it is mismatched
  with the usual ffs server block size of 16K.  The inefficiencies
  from this are less noticeable than the inefficiencies from a mismatch
  with the nfs buffer size.

- scp for some reason doesn't use the advertised best blocksize of
  st_blksize = 512.  It uses 4K, which is almost as bad since it is
  smaller than the nfs buffer size.

- cp doesn't use the advertised best block size.  It uses mmap() for
  regular files smaller than 8M and a hard-coded block size of
  MAXBSIZE = 64K for large regular files and all non-regular files.

- the above is in FreeBSD-~5.2 (and FreeBSD-[1-4]).  st_blksize is
  much more bogus and broken in -current.  In -current, the value in
  va_blocksize that is carefully initialized for regular files by ffs
  and not so carefully initialized by nfs or for non-regular files,
  is not actually used even for regular files.  vn_stat() now uses the
  hard-coded (usually bad) value of PAGE_SIZE.  Thus st_blksize is
  useless, and ignoring it and using a larger hard-coded value in cp
  is a feature -- MAXBSIZE is too large in many cases, but a too-large
  value normally only wastes a little space while a too-small value
  normally wastes a lot of time.  MAXBSIZE is a good value for large
  files (e.g., large regular files and raw disks).  OTOH, even PAGE_SIZE
  is a waste of space for slow devices like keyboards.

- stdio is the main thing that is naive enough to believe that st_blksize
  is still useful.  The block size of BUFSIZ = 1024 in stdio.h is another
  way to get a pessimal block size, but stdio itself mainly uses it for
  strings and for what it thinks are.  It misclassifies all cdevs as ttys
  and thus uses a better block size than st_blksize = <kernel nonsense>
  for cdevs that are actually ttys and a slightly worse block size than
  st_blksize = <kernel nonsense> and a much worse block size than cp's
  MAXBSIZE for cdevs that are actually disks.

Not truncating the file in scp might be a feature for avoiding clobbering
the whole file when the copying fails early, but it doesn't seem to be
documented.

Bruce
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: NFS write() calls lead to read() calls?

Reply via email to