Re: [zfs-discuss] Unusual CIFS write bursts

Brent Jones Tue, 27 Jan 2009 16:51:45 -0800

On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <br...@servuhome.net> wrote:
> While doing some performance testing on a pair of X4540's running
> snv_105, I noticed some odd behavior while using CIFS.
> I am copying a 6TB database file (yes, a single file) over our GigE
> network to the X4540, then snapshotting that data to the secondary
> X4540.
> Writing said 6TB file can peak our gigabit network, with about
> 95-100MB/sec going over the wire (can't ask for any more, really).
>
> However, the disk IO on the X4540 appears unusual. I would expect the
> disks to be constantly writing 95-100MB/sec, but it appears it buffers
> about 1GB worth of data before committing to disk. This is in contrast
> to NFS write behavior, where as I write a 1GB file to the NFS server
> from an NFS client, traffic on the wire correlates concisely to the
> disk writes. For example, 60MB/sec on the wire via NFS will trigger
> 60MB/sec on disk. This is a single file on both cases.
>
> I wouldn't have a problem with this "buffer", it seems to be a rolling
> 10-second buffer, if I am copying several small files at lower speeds,
> the disk buffer still seems to "purge" after roughly 10 seconds, not
> when a certain size is reached. The larger the amount of data that
> goes into the buffer is what causes a problem, writing 1GB to disk can
> cause the system to slow down substantially, all network traffic
> pauses or drops to mere kilobytes a second while it writes this
> buffer.
>
> I would like to see a smoother handling of this buffer, or a tuneable
> to make the buffer write more often or fill quicker.
>
> This is a 48TB unit, 64GB ram, and the arcstat perl script reports my
> ARC is 55GB in size, with near 0% miss on reads.
>
> Has anyone seen something similar, or know of any un-documented
> tuneables to reduce the effects of this?
>
>
> Here is 'zpool iostat' output, in 1 second intervals while this "write
> storm" occurs".
>
>
> # zpool iostat pdxfilu01 1
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> pdxfilu01   2.09T  36.0T      1     61   143K  7.30M
> pdxfilu01   2.09T  36.0T      0      0      0      0
> pdxfilu01   2.09T  36.0T      0      0      0      0
> pdxfilu01   2.09T  36.0T      0      0      0      0
> pdxfilu01   2.09T  36.0T      0     60      0  7.55M
> pdxfilu01   2.09T  36.0T      0  1.70K      0   211M
> pdxfilu01   2.09T  36.0T      0  2.56K      0   323M
> pdxfilu01   2.09T  36.0T      0  2.97K      0   375M
> pdxfilu01   2.09T  36.0T      0  3.15K      0   399M
> pdxfilu01   2.09T  36.0T      0  2.22K      0   244M
> pdxfilu01   2.09T  36.0T      0      0      0      0
> pdxfilu01   2.09T  36.0T      0      0      0      0
> pdxfilu01   2.09T  36.0T      0      0      0      0
> pdxfilu01   2.09T  36.0T      0      0      0      0
>
>
> Here is my 'zpool status' output.
>
> # zpool status
>  pool: pdxfilu01
>  state: ONLINE
>  scrub: none requested
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        pdxfilu01   ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c5t0d0  ONLINE       0     0     0
>            c6t0d0  ONLINE       0     0     0
>            c7t0d0  ONLINE       0     0     0
>            c8t0d0  ONLINE       0     0     0
>            c9t0d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t1d0  ONLINE       0     0     0
>            c6t1d0  ONLINE       0     0     0
>            c7t1d0  ONLINE       0     0     0
>            c8t1d0  ONLINE       0     0     0
>            c9t1d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t2d0  ONLINE       0     0     0
>            c5t2d0  ONLINE       0     0     0
>            c7t2d0  ONLINE       0     0     0
>            c8t2d0  ONLINE       0     0     0
>            c9t2d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t3d0  ONLINE       0     0     0
>            c5t3d0  ONLINE       0     0     0
>            c6t3d0  ONLINE       0     0     0
>            c8t3d0  ONLINE       0     0     0
>            c9t3d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t4d0  ONLINE       0     0     0
>            c5t4d0  ONLINE       0     0     0
>            c6t4d0  ONLINE       0     0     0
>            c7t4d0  ONLINE       0     0     0
>            c9t4d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t5d0  ONLINE       0     0     0
>            c5t5d0  ONLINE       0     0     0
>            c6t5d0  ONLINE       0     0     0
>            c7t5d0  ONLINE       0     0     0
>            c8t5d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t6d0  ONLINE       0     0     0
>            c5t6d0  ONLINE       0     0     0
>            c6t6d0  ONLINE       0     0     0
>            c7t6d0  ONLINE       0     0     0
>            c8t6d0  ONLINE       0     0     0
>            c9t6d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c4t7d0  ONLINE       0     0     0
>            c5t7d0  ONLINE       0     0     0
>            c6t7d0  ONLINE       0     0     0
>            c7t7d0  ONLINE       0     0     0
>            c8t7d0  ONLINE       0     0     0
>            c9t7d0  ONLINE       0     0     0
>        spares
>          c6t2d0    AVAIL
>          c7t3d0    AVAIL
>          c8t4d0    AVAIL
>          c9t5d0    AVAIL
>
>
>
> --
> Brent Jones
> br...@servuhome.net
>


I found some insight to the behavior I found at this Sun blog by Roch
Bourbonnais : http://blogs.sun.com/roch/date/20080514

Excerpt from the section that I seem to have encountered:

"The new code keeps track of the amount of data accepted in a TXG and
the time it takes to sync. It dynamically adjusts that amount so that
each TXG sync takes about 5 seconds (txg_time variable). It also
clamps the limit to no more than 1/8th of physical memory. "

So, when I fill up that transaction group buffer, that is when I see
that 4-5 second "I/O burst" of several hundred megabytes per second.
He also documents that the buffer flush can, and does issue delays to
the writing threads, which is why I'm seeing those momentary drops in
throughput and sluggish system performance while that write buffer is
flushed to disk.

Wish there was a better way to handle that, but at the speed I'm
writing (and I'll be getting a 10GigE link soon), I don't see any
other graceful methods of handling that much data in a buffer.

Loving these X4540's so far though...

-- 
Brent Jones
br...@servuhome.net
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Unusual CIFS write bursts

Reply via email to