comment far below...

Brent Jones wrote:
> On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <br...@servuhome.net> wrote:
>   
>> While doing some performance testing on a pair of X4540's running
>> snv_105, I noticed some odd behavior while using CIFS.
>> I am copying a 6TB database file (yes, a single file) over our GigE
>> network to the X4540, then snapshotting that data to the secondary
>> X4540.
>> Writing said 6TB file can peak our gigabit network, with about
>> 95-100MB/sec going over the wire (can't ask for any more, really).
>>
>> However, the disk IO on the X4540 appears unusual. I would expect the
>> disks to be constantly writing 95-100MB/sec, but it appears it buffers
>> about 1GB worth of data before committing to disk. This is in contrast
>> to NFS write behavior, where as I write a 1GB file to the NFS server
>> from an NFS client, traffic on the wire correlates concisely to the
>> disk writes. For example, 60MB/sec on the wire via NFS will trigger
>> 60MB/sec on disk. This is a single file on both cases.
>>
>> I wouldn't have a problem with this "buffer", it seems to be a rolling
>> 10-second buffer, if I am copying several small files at lower speeds,
>> the disk buffer still seems to "purge" after roughly 10 seconds, not
>> when a certain size is reached. The larger the amount of data that
>> goes into the buffer is what causes a problem, writing 1GB to disk can
>> cause the system to slow down substantially, all network traffic
>> pauses or drops to mere kilobytes a second while it writes this
>> buffer.
>>
>> I would like to see a smoother handling of this buffer, or a tuneable
>> to make the buffer write more often or fill quicker.
>>
>> This is a 48TB unit, 64GB ram, and the arcstat perl script reports my
>> ARC is 55GB in size, with near 0% miss on reads.
>>
>> Has anyone seen something similar, or know of any un-documented
>> tuneables to reduce the effects of this?
>>
>>
>> Here is 'zpool iostat' output, in 1 second intervals while this "write
>> storm" occurs".
>>
>>
>> # zpool iostat pdxfilu01 1
>>               capacity     operations    bandwidth
>> pool         used  avail   read  write   read  write
>> ----------  -----  -----  -----  -----  -----  -----
>> pdxfilu01   2.09T  36.0T      1     61   143K  7.30M
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>> pdxfilu01   2.09T  36.0T      0     60      0  7.55M
>> pdxfilu01   2.09T  36.0T      0  1.70K      0   211M
>> pdxfilu01   2.09T  36.0T      0  2.56K      0   323M
>> pdxfilu01   2.09T  36.0T      0  2.97K      0   375M
>> pdxfilu01   2.09T  36.0T      0  3.15K      0   399M
>> pdxfilu01   2.09T  36.0T      0  2.22K      0   244M
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>> pdxfilu01   2.09T  36.0T      0      0      0      0
>>
>>
>> Here is my 'zpool status' output.
>>
>> # zpool status
>>  pool: pdxfilu01
>>  state: ONLINE
>>  scrub: none requested
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        pdxfilu01   ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c5t0d0  ONLINE       0     0     0
>>            c6t0d0  ONLINE       0     0     0
>>            c7t0d0  ONLINE       0     0     0
>>            c8t0d0  ONLINE       0     0     0
>>            c9t0d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t1d0  ONLINE       0     0     0
>>            c6t1d0  ONLINE       0     0     0
>>            c7t1d0  ONLINE       0     0     0
>>            c8t1d0  ONLINE       0     0     0
>>            c9t1d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t2d0  ONLINE       0     0     0
>>            c5t2d0  ONLINE       0     0     0
>>            c7t2d0  ONLINE       0     0     0
>>            c8t2d0  ONLINE       0     0     0
>>            c9t2d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t3d0  ONLINE       0     0     0
>>            c5t3d0  ONLINE       0     0     0
>>            c6t3d0  ONLINE       0     0     0
>>            c8t3d0  ONLINE       0     0     0
>>            c9t3d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t4d0  ONLINE       0     0     0
>>            c5t4d0  ONLINE       0     0     0
>>            c6t4d0  ONLINE       0     0     0
>>            c7t4d0  ONLINE       0     0     0
>>            c9t4d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t5d0  ONLINE       0     0     0
>>            c5t5d0  ONLINE       0     0     0
>>            c6t5d0  ONLINE       0     0     0
>>            c7t5d0  ONLINE       0     0     0
>>            c8t5d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t6d0  ONLINE       0     0     0
>>            c5t6d0  ONLINE       0     0     0
>>            c6t6d0  ONLINE       0     0     0
>>            c7t6d0  ONLINE       0     0     0
>>            c8t6d0  ONLINE       0     0     0
>>            c9t6d0  ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            c4t7d0  ONLINE       0     0     0
>>            c5t7d0  ONLINE       0     0     0
>>            c6t7d0  ONLINE       0     0     0
>>            c7t7d0  ONLINE       0     0     0
>>            c8t7d0  ONLINE       0     0     0
>>            c9t7d0  ONLINE       0     0     0
>>        spares
>>          c6t2d0    AVAIL
>>          c7t3d0    AVAIL
>>          c8t4d0    AVAIL
>>          c9t5d0    AVAIL
>>
>>
>>
>> --
>> Brent Jones
>> br...@servuhome.net
>>
>>     
>
> I found some insight to the behavior I found at this Sun blog by Roch
> Bourbonnais : http://blogs.sun.com/roch/date/20080514
>
> Excerpt from the section that I seem to have encountered:
>
> "The new code keeps track of the amount of data accepted in a TXG and
> the time it takes to sync. It dynamically adjusts that amount so that
> each TXG sync takes about 5 seconds (txg_time variable). It also
> clamps the limit to no more than 1/8th of physical memory. "
>
> So, when I fill up that transaction group buffer, that is when I see
> that 4-5 second "I/O burst" of several hundred megabytes per second.
> He also documents that the buffer flush can, and does issue delays to
> the writing threads, which is why I'm seeing those momentary drops in
> throughput and sluggish system performance while that write buffer is
> flushed to disk.
>   

Yes, this tends to be more efficient. You can tune it by setting
zfs_txg_synctime, which is 5 by default.  It is rare that we've seen
this be a win, which is why we don't mention it in the Evil Tuning
Guide.

> Wish there was a better way to handle that, but at the speed I'm
> writing (and I'll be getting a 10GigE link soon), I don't see any
> other graceful methods of handling that much data in a buffer
>   

I think your workload might change dramatically when you get a
faster pipe.  So unless you really feel compelled to change it, I
wouldn't suggest changing it.
 -- richard

> Loving these X4540's so far though...
>
>   

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to