Re: [zfs-discuss] zfs fragmentation

Mike Gerdts Tue, 11 Aug 2009 07:15:36 -0700

On Tue, Aug 11, 2009 at 7:33 AM, Ed Spencer<ed_spen...@umanitoba.ca> wrote:
> I've come up with a better name for the concept of file and directory
> fragmentation which is, "Filesystem Entropy". Where, over time, an
> active and volitile  filesystem moves from an organized state to a
> disorganized state resulting in backup difficulties.
>
> Here are some stats which illustrate the issue:
>
> First the development mail server:
> ==================================
> (Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to
> 2097152)
>
> Small file workload (copy from zfs on iscsi network to local ufs
> filesystem)
> # zpool iostat 10
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> space       70.5G  29.0G      3      0   247K  59.7K
> space       70.5G  29.0G    136      0  8.37M      0
> space       70.5G  29.0G    115      0  6.31M      0
> space       70.5G  29.0G    108      0  7.08M      0
> space       70.5G  29.0G    105      0  3.72M      0
> space       70.5G  29.0G    135      0  3.74M      0
> space       70.5G  29.0G    155      0  6.09M      0
> space       70.5G  29.0G    193      0  4.85M      0
> space       70.5G  29.0G    142      0  5.73M      0
> space       70.5G  29.0G    159      0  7.87M      0


So you are averaging about 6 MB/s on a small file workload.  The
average read size was about 44 KB.

This throughput could be limited by the file creation rate on UFS.
Perhaps a better command to use to judge of how fast a single stream
can read is "tar cf /dev/null $dir".

> Large File workload (cd and dvd iso's)
> # zpool iostat 10
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> space       70.5G  29.0G      3      0   224K  59.8K
> space       70.5G  29.0G    462      0  57.8M      0
> space       70.5G  29.0G    427      0  53.5M      0
> space       70.5G  29.0G    406      0  50.8M      0
> space       70.5G  29.0G    430      0  53.8M      0
> space       70.5G  29.0G    382      0  47.9M      0

Here the average throughput was about 53 MB/s, with the average read
size at 128 KB.  Note that 128 KB is not only the largest block size
that ZFS supports, it is also the default value of maxphys.  Tuning
maxphys to 1 MB may give you a performance boost, so long as the files
are contiguous.  Unless the files were trickled in very slowly with a
lot of other IO at the same time, they are probably mostly contiguous.

1 Gbit links, they are at about 25% capacity - good.  I assume you
have similar load balancing at the NetApp side too.

In a previous message you said that this server was seeing better
backup throughput than the production server.  How does the mixture of
large files vs. small files compare on the two systems?

> The production mail server:
> ===========================
> Mail system is running with 790 imap users logged in (low imap work
> load).
> Two backup streams are running.
> Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
> to 2097152
>    - we've never seen any effect of changing the iscsi transport
> parameters
>      under this small file workload.
>
> # zpool iostat 10
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> space       1.06T   955G     96     69  5.20M  2.69M
> space       1.06T   955G    175    105  8.96M  2.22M
> space       1.06T   955G    182     16  4.47M   546K
> space       1.06T   955G    170     16  4.82M  1.85M
> space       1.06T   955G    145    159  4.23M  3.19M
> space       1.06T   955G    138     15  4.97M  92.7K
> space       1.06T   955G    134     15  3.82M  1.71M
> space       1.06T   955G    109    123  3.07M  3.08M
> space       1.06T   955G    106     11  3.07M  1.34M
> space       1.06T   955G    120     17  3.69M  1.74M

Here your average read throughput is about 4.6 MB/s with an average
read size of 47 KB.  That looks a lot like the simulation in the
non-production environment.

I would guess that the average message size is somewhere in the
40 - 50 KB range.

>
> # prstat -mL
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>  12438 root      12 6.9 0.0 0.0 0.0 0.0  81 0.1 508  84  4K   0 save/1
>  27399 cyrus     15 0.5 0.0 0.0 0.0 0.0  85 0.0  18  10 297   0 imapd/1
>  20230 root     3.9 8.0 0.0 0.0 0.0 0.0  88 0.1 393  33  2K   0 save/1
[snip]

The "save" process is from Networker, right?  These process do not
look CPU bound (less than 20% on CPU).

In a previous message you showed iostat data at a time when backups
weren't running.  I've reproduced below, removing the device column
for sake of formatting.

> iostat -xpn 5 5
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>
>   17.1   29.2  746.7  317.1  0.0  0.6    0.0   12.5   0  27
>   25.0   11.9  991.9  277.0  0.0  0.6    0.0   16.1   0  36
>   14.9   17.9  423.0  406.4  0.0  0.3    0.0   10.2   0  21
>   20.8   17.4  588.9  361.2  0.0  0.4    0.0   11.5   0  30
>
> and:
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   11.9   43.0  528.9 1972.8  0.0  2.1    0.0   38.9   0  31
>   17.0   19.6  496.9 1499.0  0.0  1.4    0.0   38.8   0  39
>   14.0   30.0  670.2 1971.3  0.0  1.7    0.0   38.0   0  34
>   19.7   28.7  985.2 1647.6  0.0  1.6    0.0   32.5   0  37
> and:
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   22.7   41.3  973.7  423.5  0.0  0.8    0.0   11.8   0  34
>   27.9   20.0 1474.7  344.0  0.0  0.8    0.0   16.7   0  42
>   15.1   17.9 1318.7  463.7  0.0  0.6    0.0   17.7   0  19
>   22.3   19.5 1801.7  406.7  0.0  0.8    0.0   20.0   0  29

Service times are in the 10 - 39 ms range.  In the middle set, it
looks like there is some heavier than normal write activity (not more
writes, just bigger writes) and this seems to impact asvc_t.

Let's look back at something I said the other day...

| Have you looked at iostat data to be sure that you are seeing asvc_t
| + wsvc_t that supports the number of operations that you need to
| perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms,
| a workload that waits for the completion of one I/O before issuing
| the next will max out at 100 iops.  Presumably ZFS should hide some
| of this from you[1], but it does suggest that each backup stream
| would be limited to about 100 files per second[2].  This is because
| the read request for one file does not happen before the close of
| the previous file[3].  Since cyrus stores each message as a separate
| file, this suggests that 2.5 MB/s corresponds to average mail
| message size of 25 KB.

It seems reasonable based on the iostat data to say that the typical
asvc_t is no better than 15 ms.  Since the IO for one file does not
start until the previous one completed, we can get no more than:

    1000 ms/sec
    -----------  = 67 sequential operations per second
      15 ms/io

By "sequential" I mean that one doesn't start until the other
finishes.  There is certainly a better word, but it escapes me at the
moment.

At an average file size of 45 KB, that translates to about 3 MB/sec.
As you run two data streams, you are seeing throughput that looks
kinda like the 2 * 3 MB/sec.

With 4 backup streams do you get something that looks like 4 * 3 MB/s?
How does that effect iostat output?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs fragmentation

Reply via email to