comment far below... Brent Jones wrote: > On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <br...@servuhome.net> wrote: > >> While doing some performance testing on a pair of X4540's running >> snv_105, I noticed some odd behavior while using CIFS. >> I am copying a 6TB database file (yes, a single file) over our GigE >> network to the X4540, then snapshotting that data to the secondary >> X4540. >> Writing said 6TB file can peak our gigabit network, with about >> 95-100MB/sec going over the wire (can't ask for any more, really). >> >> However, the disk IO on the X4540 appears unusual. I would expect the >> disks to be constantly writing 95-100MB/sec, but it appears it buffers >> about 1GB worth of data before committing to disk. This is in contrast >> to NFS write behavior, where as I write a 1GB file to the NFS server >> from an NFS client, traffic on the wire correlates concisely to the >> disk writes. For example, 60MB/sec on the wire via NFS will trigger >> 60MB/sec on disk. This is a single file on both cases. >> >> I wouldn't have a problem with this "buffer", it seems to be a rolling >> 10-second buffer, if I am copying several small files at lower speeds, >> the disk buffer still seems to "purge" after roughly 10 seconds, not >> when a certain size is reached. The larger the amount of data that >> goes into the buffer is what causes a problem, writing 1GB to disk can >> cause the system to slow down substantially, all network traffic >> pauses or drops to mere kilobytes a second while it writes this >> buffer. >> >> I would like to see a smoother handling of this buffer, or a tuneable >> to make the buffer write more often or fill quicker. >> >> This is a 48TB unit, 64GB ram, and the arcstat perl script reports my >> ARC is 55GB in size, with near 0% miss on reads. >> >> Has anyone seen something similar, or know of any un-documented >> tuneables to reduce the effects of this? >> >> >> Here is 'zpool iostat' output, in 1 second intervals while this "write >> storm" occurs". >> >> >> # zpool iostat pdxfilu01 1 >> capacity operations bandwidth >> pool used avail read write read write >> ---------- ----- ----- ----- ----- ----- ----- >> pdxfilu01 2.09T 36.0T 1 61 143K 7.30M >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 60 0 7.55M >> pdxfilu01 2.09T 36.0T 0 1.70K 0 211M >> pdxfilu01 2.09T 36.0T 0 2.56K 0 323M >> pdxfilu01 2.09T 36.0T 0 2.97K 0 375M >> pdxfilu01 2.09T 36.0T 0 3.15K 0 399M >> pdxfilu01 2.09T 36.0T 0 2.22K 0 244M >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> >> >> Here is my 'zpool status' output. >> >> # zpool status >> pool: pdxfilu01 >> state: ONLINE >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> pdxfilu01 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c5t0d0 ONLINE 0 0 0 >> c6t0d0 ONLINE 0 0 0 >> c7t0d0 ONLINE 0 0 0 >> c8t0d0 ONLINE 0 0 0 >> c9t0d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t1d0 ONLINE 0 0 0 >> c6t1d0 ONLINE 0 0 0 >> c7t1d0 ONLINE 0 0 0 >> c8t1d0 ONLINE 0 0 0 >> c9t1d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t2d0 ONLINE 0 0 0 >> c5t2d0 ONLINE 0 0 0 >> c7t2d0 ONLINE 0 0 0 >> c8t2d0 ONLINE 0 0 0 >> c9t2d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t3d0 ONLINE 0 0 0 >> c5t3d0 ONLINE 0 0 0 >> c6t3d0 ONLINE 0 0 0 >> c8t3d0 ONLINE 0 0 0 >> c9t3d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t4d0 ONLINE 0 0 0 >> c5t4d0 ONLINE 0 0 0 >> c6t4d0 ONLINE 0 0 0 >> c7t4d0 ONLINE 0 0 0 >> c9t4d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t5d0 ONLINE 0 0 0 >> c5t5d0 ONLINE 0 0 0 >> c6t5d0 ONLINE 0 0 0 >> c7t5d0 ONLINE 0 0 0 >> c8t5d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t6d0 ONLINE 0 0 0 >> c5t6d0 ONLINE 0 0 0 >> c6t6d0 ONLINE 0 0 0 >> c7t6d0 ONLINE 0 0 0 >> c8t6d0 ONLINE 0 0 0 >> c9t6d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t7d0 ONLINE 0 0 0 >> c5t7d0 ONLINE 0 0 0 >> c6t7d0 ONLINE 0 0 0 >> c7t7d0 ONLINE 0 0 0 >> c8t7d0 ONLINE 0 0 0 >> c9t7d0 ONLINE 0 0 0 >> spares >> c6t2d0 AVAIL >> c7t3d0 AVAIL >> c8t4d0 AVAIL >> c9t5d0 AVAIL >> >> >> >> -- >> Brent Jones >> br...@servuhome.net >> >> > > I found some insight to the behavior I found at this Sun blog by Roch > Bourbonnais : http://blogs.sun.com/roch/date/20080514 > > Excerpt from the section that I seem to have encountered: > > "The new code keeps track of the amount of data accepted in a TXG and > the time it takes to sync. It dynamically adjusts that amount so that > each TXG sync takes about 5 seconds (txg_time variable). It also > clamps the limit to no more than 1/8th of physical memory. " > > So, when I fill up that transaction group buffer, that is when I see > that 4-5 second "I/O burst" of several hundred megabytes per second. > He also documents that the buffer flush can, and does issue delays to > the writing threads, which is why I'm seeing those momentary drops in > throughput and sluggish system performance while that write buffer is > flushed to disk. >
Yes, this tends to be more efficient. You can tune it by setting zfs_txg_synctime, which is 5 by default. It is rare that we've seen this be a win, which is why we don't mention it in the Evil Tuning Guide. > Wish there was a better way to handle that, but at the speed I'm > writing (and I'll be getting a 10GigE link soon), I don't see any > other graceful methods of handling that much data in a buffer > I think your workload might change dramatically when you get a faster pipe. So unless you really feel compelled to change it, I wouldn't suggest changing it. -- richard > Loving these X4540's so far though... > > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss