Re: [zfs-discuss] zfs fragmentation

Ed Spencer Sat, 08 Aug 2009 15:36:44 -0700

On Sat, 2009-08-08 at 15:05, Mike Gerdts wrote:
> On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencer<ed_spen...@umanitoba.ca> wrote:
> >
> > On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
> >> Many of us here already tested our own systems and found that under
> >> some conditions ZFS was offering up only 30MB/second for bulk data
> >> reads regardless of how exotic our storage pool and hardware was.
> >
> > Just so we are using the same units of measurements. Backup/copy
> > throughput on our development mail server is 8.5MB/sec. The people
> > running our backups would be over joyed with that performance.
> >
> > However backup/copy throughput on our production mail server is 2.25
> > MB/sec.
> >
> > The underlying disk is 15000 RPM 146GB FC drives.
> > Our performance may be hampered somewhat because the luns are on a
> > Network Appliance accessed via iSCSI, but not to the extent that we are
> > seeing, and it does not account for the throughput difference in the
> > development and production pools.
> 
> NetApp filers run WAFL - Write Anywhere File Layout.  Even if ZFS
> arranged everything perfrectly (however that is defined) WAFL would
> undo its hard work.
> 
> Since you are using iSCSI, I assume that you have disabled the Nagle
> algorithm and increased  tcp_xmit_hiwat and tcp_recv_hiwat.  If not,
> go do that now.
We've tried many different iscsi parameter changes on our development server:
Jumbo Frames
Disabling the Nagle
I'll double check next week on tcp_xmit_hiwat and tcp_recv_hiwat.


Nothing has made any real difference. 
We are only using about 5% of the bandwidth on our IPSan.

We use two cisco ethernet switches on the IPSAN. The iscsi initiators
use MPXIO in a round robin configuration.  

> > When I talk about fragmentation its not in the normal sense. I'm not
> > talking about blocks in a file not being sequential. I'm talking about
> > files in a single directory that end up spread across the entire
> > filesytem/pool.
> 
> It's tempting to think that if the files were in roughly the same area
> of the block device that ZFS sees that reading the files sequentially
> would at least trigger a read-ahead at the filer.  I suspect that even
> a moderate amount of file creation and deletion would cause the I/O
> pattern to be random enough (not purely sequential) that the back-end
> storage would not have a reasonable chance of recognizing it as a good
> time for read-ahead.  Further, since the backup application is
> probably in a loop of:
> 
> while there are more files in the directory
>    if next file mtime > last backup time
>        open file
>        read file contents, send to backup stream
>        close file
>     end if
> end while
> 
> In other words, other I/O operations are interspersed between the
> sequential data reads, some files are likely to be skipped, and there
> is latency introduced by writing to the data stream.  I would be
> surprised to see any file system do intelligent read-ahead here.  In
> other words, lots of small file operations make backups and especially
> restores go slowly.  More backup and restore streams will almost
> certainly help.  Multiplex the streams so that you can keep your tapes
> moving at a constant speed.

We backup to disk first and then put to tape later.

> Do you have statistics on network utilization to ensure that you
> aren't stressing it?
> 
> Have you looked at iostat data to be sure that you are seeing asvc_t +
> wsvc_t that supports the number of operations that you need to
> perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms, a
> workload that waits for the completion of one I/O before issuing the
> next will max out at 100 iops.  Presumably ZFS should hide some of
> this from you[1], but it does suggest that each backup stream would be
> limited to about 100 files per second[2].  This is because the read
> request for one file does not happen before the close of the previous
> file[3].  Since cyrus stores each message as a separate file, this
> suggests that 2.5 MB/s corresponds to average mail message size of 25
> KB.
> 
> 1. via metadata caching, read-ahead on file data reads, etc.
> 2. Assuming wsvc_t + asvc_t = 10 ms
> 3. Assuming that networker is about as smart as tar, zip, cpio, etc.

There is a backup of a single filesystem in the pool going on right now:
# zpool iostat 5 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       1.05T   965G     97     69  5.24M  2.71M
space       1.05T   965G    113     10  6.41M   996K
space       1.05T   965G    100    112  2.87M  1.81M
space       1.05T   965G    112      8  2.35M  35.9K
space       1.05T   965G    106      3  1.76M  55.1K

Here are examples :
iostat -xpn 5 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

   17.1   29.2  746.7  317.1  0.0  0.6    0.0   12.5   0  27
c4t60A98000433469764E4A2D456A644A74d0
   25.0   11.9  991.9  277.0  0.0  0.6    0.0   16.1   0  36
c4t60A98000433469764E4A2D456A696579d0
   14.9   17.9  423.0  406.4  0.0  0.3    0.0   10.2   0  21
c4t60A98000433469764E4A476D2F664E4Fd0
   20.8   17.4  588.9  361.2  0.0  0.4    0.0   11.5   0  30
c4t60A98000433469764E4A476D2F6B385Ad0
   
and:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   11.9   43.0  528.9 1972.8  0.0  2.1    0.0   38.9   0  31
c4t60A98000433469764E4A2D456A644A74d0
   17.0   19.6  496.9 1499.0  0.0  1.4    0.0   38.8   0  39
c4t60A98000433469764E4A2D456A696579d0
   14.0   30.0  670.2 1971.3  0.0  1.7    0.0   38.0   0  34
c4t60A98000433469764E4A476D2F664E4Fd0
   19.7   28.7  985.2 1647.6  0.0  1.6    0.0   32.5   0  37
c4t60A98000433469764E4A476D2F6B385Ad0
and:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   22.7   41.3  973.7  423.5  0.0  0.8    0.0   11.8   0  34
c4t60A98000433469764E4A2D456A644A74d0
   27.9   20.0 1474.7  344.0  0.0  0.8    0.0   16.7   0  42
c4t60A98000433469764E4A2D456A696579d0
   15.1   17.9 1318.7  463.7  0.0  0.6    0.0   17.7   0  19
c4t60A98000433469764E4A476D2F664E4Fd0
   22.3   19.5 1801.7  406.7  0.0  0.8    0.0   20.0   0  29
c4t60A98000433469764E4A476D2F6B385Ad0
 



> > My problem right now is diagnosing the performance issues.  I can't
> > address them without understanding the underlying cause.  There is a
> > lack of tools to help in this area. There is also a lack of acceptance
> > that I'm actually having a problem with zfs. Its frustrating.
> 
> This is a prime example of why Sun needs to sell Analytics[4][5] as an
> add-on to Solaris in general.  This problem is just as hard to figure
> out on Solaris as it is on Linux, Windows, etc.  If Analytics were
> bundled with Gold and above support contracts, it would be a very
> compelling reason to shell out a few extra bucks for better support
> contract.
> 
> 4. http://blogs.sun.com/bmc/resource/cec_analytics.pdf
> 5. http://blogs.sun.com/brendan/category/Fishworks
> 

Oh definitely!
It will also give me the oppurtunity to yell at my drives!
Might help to relieve some stress.
http://sunbeltblog.blogspot.com/2009/01/yelling-at-your-hard-drive.html

> > Anyone know how significantly increase the performance of a zfs
> > filesystem without causing any downtime to an Enterprise email system
> > used by 30,000 intolerant people, when you don't really know what is
> > causing the performance issues in the first place? (Yeah, it sucks to be
> > me!)
> 
> Hopefully I've helped find a couple places to look...

Thanx

-- 
Ed 


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs fragmentation

Reply via email to