On Mon, 27 May 2013, Klaus Weber wrote:

On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
On Sun, 26 May 2013, Klaus Weber wrote:
Description:
During heavy disk I/O (two bonnie++ processes working on the same disk
simultaneously) causes an extreme degradation in disk throughput (combined
throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows
when both bonnie++ processes are in the "Rewriting..." phase.

Please use the unix newline character in mail.

My apologies. I submitted the report via the web-interface and did not
realize that it would come out this way.

Thanks.  The log output somehow came out right.

I found that
the problem could be fixed by killing cluster_write() by turning it into
bdwrite() (by editing the running kernel using ddb, since this is easier
than rebuilding the kernel).  I was trying many similar things since I
had a theory that cluster_write() is useless.  [...]

If that would provide a useful datapoint, I could try if that make a
difference on my system. What changes would be required to test this?

Surely its not as easy as replacing the function body of
cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?

That should work for testing, but it is safer to edit ffs_write()
and remove the block where it calls cluster_write() (or bawrite()),
so that it falls through to call bdwrite() in most cases.

My theory for what the bug is is that
cluster_write() and cluster_read() share the limit resource of pbufs.
pbufs are not managed as carefully as normal buffers.  In particular,
there is nothing to limit write pressure from pbufs like there is for
normal buffers.

Is there anything I can do to confirm rebut this? Is the number of
pbufs in use visible via a sysctl, or could I add debug printfs that
are triggered when certain limits are reached?

Here I don't really know what to look for.  First add a sysctl to read
the number of free pbufs.  The variable for this is cluster_pbuf_freecnt
in vm.

newfs -b 64k -f 8k /dev/da0p1

The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
Someone increased the default from 16k to 32k without changing the buffer
cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
small, but on 32-bit arches kernel virtual memory is too limited to have
a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
although this problem doesn't affetc 64-bit arches.

-b 64k is worse.

Thank you for this explanation. I was not aware that -b 64k (or even
the default values to newfs) would have this effect. I will repeat the
tests with 32/4k and 16/2k, although I seem to remember that 64/8k
provided a significant performance boost over the defaults. This, and
the reduced fsck times was my original motivation to go with the
larger values.

The reduced fsck time and perhaps the reduced number of cylinder groups
are the main advantages of large clusters.  vfs-level clustering turns
most physical i/o's into 128K-blocks (especially for large files) so
there is little difference between the i/o speed for all fs block sizes
unless the fs block size is very small.

Given the potentially drastic effects of block sizes other than 16/2k,
maybe a warning should be added to the newfs manpage? I only found the
strong advice to maintain a 8:1 buffer:fragment ratio.

Once the kernel misconfiguration understood eniygh for such a warning to
not be FUD, it should be easy to fix.

When both bonnie++ processes are in their "Rewriting" phase, the system
hangs within a few seconds. Both bonnie++ processes are in state "nbufkv".
bufdaemon takes about 40% CPU time and is in state "qsleep" when not
active.

You got the buffer cache fragmentation that you asked for.

Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
try adding some debug output to this code to get some insight why this
does not work, or not as effective as it should?

Don't start there, since it is complicated and timing-dependent.  Maybe
add some printfs to make it easy to see when it enters and leaves defrag
mode.

Apparently you found a way to reproduce the serious fragmentaion
problems.

A key factor seems to be the "Rewriting" operation. I see no problem
during the "normal" writing, nor could I reproduce it with concurrent
dd runs.

I don't know exactly what bonnie rewrite bmode does.  Is it just read/
[modify]/write of sequential blocks with a fairly small block size?
Old bonnie docs say that the block size is always 8K.  One reason I
don't like bonnie.  Clustering should work fairly normally with that.
Anything with random seeks would break clustering.

Increasing BKVASIZE would take more work than this, since although it
was intended to be a system parameter which could be changed to reduce
the fragmentation problem, one of the many bugs in it is that it was
never converted into a "new" kernel option.  Another of the bugs in
it is that doubling it halves the number of buffers, so doubling it
does more than use twice as much kva.  This severely limited the number
of buffers back when memory sizes were 64MB.  It is not a very
significant limitation if the memory size is 1GB or larger.

Should I try to experiment with BKVASIZE of 65536? If so, can I
somehow up the number of buffers again? Also, after modifying
BKVASIZE, is it sufficient to compile and install a new kernel, or do
I have to build and install the entire world?

Just the kernel, but changing sys/param.h will make most of the world
want to recompile itself according to dependencies.  I don't like rebuilding
things, and often set timestamps in header files back to what they were
to avoid rebuilding (after rebuilding only the object files that actually
depend on the change).  Use this hack with caution, or rebuild kernels in
a separate tree that doesn't affect the world.

[second bonnie goes Rewriting as well]
00-04-24.log:vfs.numdirtybuffers: 11586
00-04-25.log:vfs.numdirtybuffers: 16325
00-04-26.log:vfs.numdirtybuffers: 24333
...
00-04-54.log:vfs.numdirtybuffers: 52096
00-04-57.log:vfs.numdirtybuffers: 52098
00-05-00.log:vfs.numdirtybuffers: 52096
[ etc. ]

This is a rather large buildup and may indicate a problem.  Try reducing
the dirty buffer watermarks.  Their default values are mostly historical
nonsense.

You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
reasonable starting values for experimenting? 800/200?

1000 or 10000 (if nbuf is 50000).  1000 is probably too conservative, but
I think it is plenty for most loads.

Bruce
_______________________________________________
freebsd-bugs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"

Reply via email to