Re: [zfs-discuss] zfs fragmentation

Mike Gerdts Sat, 08 Aug 2009 13:05:59 -0700

On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencer<ed_spen...@umanitoba.ca> wrote:
>
> On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
>> Many of us here already tested our own systems and found that under
>> some conditions ZFS was offering up only 30MB/second for bulk data
>> reads regardless of how exotic our storage pool and hardware was.
>
> Just so we are using the same units of measurements. Backup/copy
> throughput on our development mail server is 8.5MB/sec. The people
> running our backups would be over joyed with that performance.
>
> However backup/copy throughput on our production mail server is 2.25
> MB/sec.
>
> The underlying disk is 15000 RPM 146GB FC drives.
> Our performance may be hampered somewhat because the luns are on a
> Network Appliance accessed via iSCSI, but not to the extent that we are
> seeing, and it does not account for the throughput difference in the
> development and production pools.


NetApp filers run WAFL - Write Anywhere File Layout.  Even if ZFS
arranged everything perfrectly (however that is defined) WAFL would
undo its hard work.

Since you are using iSCSI, I assume that you have disabled the Nagle
algorithm and increased  tcp_xmit_hiwat and tcp_recv_hiwat.  If not,
go do that now.

> When I talk about fragmentation its not in the normal sense. I'm not
> talking about blocks in a file not being sequential. I'm talking about
> files in a single directory that end up spread across the entire
> filesytem/pool.

It's tempting to think that if the files were in roughly the same area
of the block device that ZFS sees that reading the files sequentially
would at least trigger a read-ahead at the filer.  I suspect that even
a moderate amount of file creation and deletion would cause the I/O
pattern to be random enough (not purely sequential) that the back-end
storage would not have a reasonable chance of recognizing it as a good
time for read-ahead.  Further, since the backup application is
probably in a loop of:

while there are more files in the directory
   if next file mtime > last backup time
       open file
       read file contents, send to backup stream
       close file
    end if
end while

In other words, other I/O operations are interspersed between the
sequential data reads, some files are likely to be skipped, and there
is latency introduced by writing to the data stream.  I would be
surprised to see any file system do intelligent read-ahead here.  In
other words, lots of small file operations make backups and especially
restores go slowly.  More backup and restore streams will almost
certainly help.  Multiplex the streams so that you can keep your tapes
moving at a constant speed.

Do you have statistics on network utilization to ensure that you
aren't stressing it?

Have you looked at iostat data to be sure that you are seeing asvc_t +
wsvc_t that supports the number of operations that you need to
perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms, a
workload that waits for the completion of one I/O before issuing the
next will max out at 100 iops.  Presumably ZFS should hide some of
this from you[1], but it does suggest that each backup stream would be
limited to about 100 files per second[2].  This is because the read
request for one file does not happen before the close of the previous
file[3].  Since cyrus stores each message as a separate file, this
suggests that 2.5 MB/s corresponds to average mail message size of 25
KB.

1. via metadata caching, read-ahead on file data reads, etc.
2. Assuming wsvc_t + asvc_t = 10 ms
3. Assuming that networker is about as smart as tar, zip, cpio, etc.

> My problem right now is diagnosing the performance issues.  I can't
> address them without understanding the underlying cause.  There is a
> lack of tools to help in this area. There is also a lack of acceptance
> that I'm actually having a problem with zfs. Its frustrating.

This is a prime example of why Sun needs to sell Analytics[4][5] as an
add-on to Solaris in general.  This problem is just as hard to figure
out on Solaris as it is on Linux, Windows, etc.  If Analytics were
bundled with Gold and above support contracts, it would be a very
compelling reason to shell out a few extra bucks for better support
contract.

4. http://blogs.sun.com/bmc/resource/cec_analytics.pdf
5. http://blogs.sun.com/brendan/category/Fishworks

> Anyone know how significantly increase the performance of a zfs
> filesystem without causing any downtime to an Enterprise email system
> used by 30,000 intolerant people, when you don't really know what is
> causing the performance issues in the first place? (Yeah, it sucks to be
> me!)

Hopefully I've helped find a couple places to look...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs fragmentation

Reply via email to