On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <br...@servuhome.net> wrote: > While doing some performance testing on a pair of X4540's running > snv_105, I noticed some odd behavior while using CIFS. > I am copying a 6TB database file (yes, a single file) over our GigE > network to the X4540, then snapshotting that data to the secondary > X4540. > Writing said 6TB file can peak our gigabit network, with about > 95-100MB/sec going over the wire (can't ask for any more, really). > > However, the disk IO on the X4540 appears unusual. I would expect the > disks to be constantly writing 95-100MB/sec, but it appears it buffers > about 1GB worth of data before committing to disk. This is in contrast > to NFS write behavior, where as I write a 1GB file to the NFS server > from an NFS client, traffic on the wire correlates concisely to the > disk writes. For example, 60MB/sec on the wire via NFS will trigger > 60MB/sec on disk. This is a single file on both cases. > > I wouldn't have a problem with this "buffer", it seems to be a rolling > 10-second buffer, if I am copying several small files at lower speeds, > the disk buffer still seems to "purge" after roughly 10 seconds, not > when a certain size is reached. The larger the amount of data that > goes into the buffer is what causes a problem, writing 1GB to disk can > cause the system to slow down substantially, all network traffic > pauses or drops to mere kilobytes a second while it writes this > buffer. > > I would like to see a smoother handling of this buffer, or a tuneable > to make the buffer write more often or fill quicker. > > This is a 48TB unit, 64GB ram, and the arcstat perl script reports my > ARC is 55GB in size, with near 0% miss on reads. > > Has anyone seen something similar, or know of any un-documented > tuneables to reduce the effects of this? > > > Here is 'zpool iostat' output, in 1 second intervals while this "write > storm" occurs". > > > # zpool iostat pdxfilu01 1 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > pdxfilu01 2.09T 36.0T 1 61 143K 7.30M > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 60 0 7.55M > pdxfilu01 2.09T 36.0T 0 1.70K 0 211M > pdxfilu01 2.09T 36.0T 0 2.56K 0 323M > pdxfilu01 2.09T 36.0T 0 2.97K 0 375M > pdxfilu01 2.09T 36.0T 0 3.15K 0 399M > pdxfilu01 2.09T 36.0T 0 2.22K 0 244M > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > > > Here is my 'zpool status' output. > > # zpool status > pool: pdxfilu01 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pdxfilu01 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > c6t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > c8t0d0 ONLINE 0 0 0 > c9t0d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t1d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 0 > c7t1d0 ONLINE 0 0 0 > c8t1d0 ONLINE 0 0 0 > c9t1d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t2d0 ONLINE 0 0 0 > c5t2d0 ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 0 > c8t2d0 ONLINE 0 0 0 > c9t2d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t3d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c8t3d0 ONLINE 0 0 0 > c9t3d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t4d0 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c6t4d0 ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c9t4d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t5d0 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c8t5d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t6d0 ONLINE 0 0 0 > c5t6d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0 > c8t6d0 ONLINE 0 0 0 > c9t6d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t7d0 ONLINE 0 0 0 > c5t7d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > c8t7d0 ONLINE 0 0 0 > c9t7d0 ONLINE 0 0 0 > spares > c6t2d0 AVAIL > c7t3d0 AVAIL > c8t4d0 AVAIL > c9t5d0 AVAIL > > > > -- > Brent Jones > br...@servuhome.net >
I found some insight to the behavior I found at this Sun blog by Roch Bourbonnais : http://blogs.sun.com/roch/date/20080514 Excerpt from the section that I seem to have encountered: "The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory. " So, when I fill up that transaction group buffer, that is when I see that 4-5 second "I/O burst" of several hundred megabytes per second. He also documents that the buffer flush can, and does issue delays to the writing threads, which is why I'm seeing those momentary drops in throughput and sluggish system performance while that write buffer is flushed to disk. Wish there was a better way to handle that, but at the speed I'm writing (and I'll be getting a 10GigE link soon), I don't see any other graceful methods of handling that much data in a buffer. Loving these X4540's so far though... -- Brent Jones br...@servuhome.net _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss