On Tue, Aug 11, 2009 at 7:33 AM, Ed Spencer<ed_spen...@umanitoba.ca> wrote: > I've come up with a better name for the concept of file and directory > fragmentation which is, "Filesystem Entropy". Where, over time, an > active and volitile filesystem moves from an organized state to a > disorganized state resulting in backup difficulties. > > Here are some stats which illustrate the issue: > > First the development mail server: > ================================== > (Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to > 2097152) > > Small file workload (copy from zfs on iscsi network to local ufs > filesystem) > # zpool iostat 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > space 70.5G 29.0G 3 0 247K 59.7K > space 70.5G 29.0G 136 0 8.37M 0 > space 70.5G 29.0G 115 0 6.31M 0 > space 70.5G 29.0G 108 0 7.08M 0 > space 70.5G 29.0G 105 0 3.72M 0 > space 70.5G 29.0G 135 0 3.74M 0 > space 70.5G 29.0G 155 0 6.09M 0 > space 70.5G 29.0G 193 0 4.85M 0 > space 70.5G 29.0G 142 0 5.73M 0 > space 70.5G 29.0G 159 0 7.87M 0
So you are averaging about 6 MB/s on a small file workload. The average read size was about 44 KB. This throughput could be limited by the file creation rate on UFS. Perhaps a better command to use to judge of how fast a single stream can read is "tar cf /dev/null $dir". > Large File workload (cd and dvd iso's) > # zpool iostat 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > space 70.5G 29.0G 3 0 224K 59.8K > space 70.5G 29.0G 462 0 57.8M 0 > space 70.5G 29.0G 427 0 53.5M 0 > space 70.5G 29.0G 406 0 50.8M 0 > space 70.5G 29.0G 430 0 53.8M 0 > space 70.5G 29.0G 382 0 47.9M 0 Here the average throughput was about 53 MB/s, with the average read size at 128 KB. Note that 128 KB is not only the largest block size that ZFS supports, it is also the default value of maxphys. Tuning maxphys to 1 MB may give you a performance boost, so long as the files are contiguous. Unless the files were trickled in very slowly with a lot of other IO at the same time, they are probably mostly contiguous. 1 Gbit links, they are at about 25% capacity - good. I assume you have similar load balancing at the NetApp side too. In a previous message you said that this server was seeing better backup throughput than the production server. How does the mixture of large files vs. small files compare on the two systems? > The production mail server: > =========================== > Mail system is running with 790 imap users logged in (low imap work > load). > Two backup streams are running. > Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set > to 2097152 > - we've never seen any effect of changing the iscsi transport > parameters > under this small file workload. > > # zpool iostat 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > space 1.06T 955G 96 69 5.20M 2.69M > space 1.06T 955G 175 105 8.96M 2.22M > space 1.06T 955G 182 16 4.47M 546K > space 1.06T 955G 170 16 4.82M 1.85M > space 1.06T 955G 145 159 4.23M 3.19M > space 1.06T 955G 138 15 4.97M 92.7K > space 1.06T 955G 134 15 3.82M 1.71M > space 1.06T 955G 109 123 3.07M 3.08M > space 1.06T 955G 106 11 3.07M 1.34M > space 1.06T 955G 120 17 3.69M 1.74M Here your average read throughput is about 4.6 MB/s with an average read size of 47 KB. That looks a lot like the simulation in the non-production environment. I would guess that the average message size is somewhere in the 40 - 50 KB range. > > # prstat -mL > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 12438 root 12 6.9 0.0 0.0 0.0 0.0 81 0.1 508 84 4K 0 save/1 > 27399 cyrus 15 0.5 0.0 0.0 0.0 0.0 85 0.0 18 10 297 0 imapd/1 > 20230 root 3.9 8.0 0.0 0.0 0.0 0.0 88 0.1 393 33 2K 0 save/1 [snip] The "save" process is from Networker, right? These process do not look CPU bound (less than 20% on CPU). In a previous message you showed iostat data at a time when backups weren't running. I've reproduced below, removing the device column for sake of formatting. > iostat -xpn 5 5 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > > 17.1 29.2 746.7 317.1 0.0 0.6 0.0 12.5 0 27 > 25.0 11.9 991.9 277.0 0.0 0.6 0.0 16.1 0 36 > 14.9 17.9 423.0 406.4 0.0 0.3 0.0 10.2 0 21 > 20.8 17.4 588.9 361.2 0.0 0.4 0.0 11.5 0 30 > > and: > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 11.9 43.0 528.9 1972.8 0.0 2.1 0.0 38.9 0 31 > 17.0 19.6 496.9 1499.0 0.0 1.4 0.0 38.8 0 39 > 14.0 30.0 670.2 1971.3 0.0 1.7 0.0 38.0 0 34 > 19.7 28.7 985.2 1647.6 0.0 1.6 0.0 32.5 0 37 > and: > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 22.7 41.3 973.7 423.5 0.0 0.8 0.0 11.8 0 34 > 27.9 20.0 1474.7 344.0 0.0 0.8 0.0 16.7 0 42 > 15.1 17.9 1318.7 463.7 0.0 0.6 0.0 17.7 0 19 > 22.3 19.5 1801.7 406.7 0.0 0.8 0.0 20.0 0 29 Service times are in the 10 - 39 ms range. In the middle set, it looks like there is some heavier than normal write activity (not more writes, just bigger writes) and this seems to impact asvc_t. Let's look back at something I said the other day... | Have you looked at iostat data to be sure that you are seeing asvc_t | + wsvc_t that supports the number of operations that you need to | perform? That is if asvc_t + wsvc_t for a device adds up to 10 ms, | a workload that waits for the completion of one I/O before issuing | the next will max out at 100 iops. Presumably ZFS should hide some | of this from you[1], but it does suggest that each backup stream | would be limited to about 100 files per second[2]. This is because | the read request for one file does not happen before the close of | the previous file[3]. Since cyrus stores each message as a separate | file, this suggests that 2.5 MB/s corresponds to average mail | message size of 25 KB. It seems reasonable based on the iostat data to say that the typical asvc_t is no better than 15 ms. Since the IO for one file does not start until the previous one completed, we can get no more than: 1000 ms/sec ----------- = 67 sequential operations per second 15 ms/io By "sequential" I mean that one doesn't start until the other finishes. There is certainly a better word, but it escapes me at the moment. At an average file size of 45 KB, that translates to about 3 MB/sec. As you run two data streams, you are seeing throughput that looks kinda like the 2 * 3 MB/sec. With 4 backup streams do you get something that looks like 4 * 3 MB/s? How does that effect iostat output? -- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss