Hello Roch, Monday, May 15, 2006, 3:23:14 PM, you wrote:
RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient RBPE> to saturate a regular disk. There is great body of evidence that shows RBPE> that the bigger the write sizes and matching large FS clustersize lead RBPE> to more throughput. The counter point is that ZFS schedules it's I/O RBPE> like nothing else seen before and manages to sature a single disk RBPE> using enough concurrent 128K I/O. Nevertheless I get much more throughput using UFS and writing with large block than using ZFS on the same disk. And the difference is actually quite big in favor of UFS. RBPE> <There a few things I did here for the first time; so I may have erred RBPE> at places. So I am proposing this for review by the community> RBPE> I first measured the throughput of a write(2) to raw device using for RBPE> instance this; RBPE> dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024 RBPE> On Solaris we would see some overhead of reading the block from RBPE> /dev/zero and then issuing the write call. The tightest function that RBPE> fences the I/O is default_physio(). That function will issue the I/O to RBPE> the device then wait for it to complete. If we take the elapse time RBPE> spent in this function and count the bytes that are I/O-ed, this RBPE> should give a good hint as to the throughput the device is RBPE> providing. The above dd command will issue a single I/O at a time RBPE> (d-script to measure is attached). RBPE> Trying different blocksizes I see: RBPE> Bytes Elapse of phys IO Size RBPE> Sent RBPE> RBPE> 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s RBPE> 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s RBPE> 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s RBPE> 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s RBPE> 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s RBPE> 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s RBPE> 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s RBPE> 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s RBPE> 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s RBPE> 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s Just to be sure - you did reconfigure system to actually allow larger IO sizes? RBPE> Now lets see what ZFS gets. I measure using single dd process. ZFS RBPE> will chunk up data in 128K blocks. Now the dd command interact with RBPE> memory. But the I/O are scheduled under the control of spa_sync(). So RBPE> in the d-script (attached) I check for the start of an spa_sync and RBPE> time that based on elapse. At the same time I gather the number of RBPE> bytes and keep a count of I/O (bdev_strategy) that are being issued. RBPE> When the spa_sync completes we are sure that all those are on stable RBPE> storage. The script is a bit more complex because there are 2 threads RBPE> that issue spa_sync, but only one of them actually becomes RBPE> activated. So the script will print out some spurious lines of output RBPE> at times. I measure I/O with the script while this runs: RBPE> dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000 RBPE> And I see: RBPE> 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s RBPE> 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s RBPE> OK, I cheated. Here, ZFS is given a full disk to play with. In this RBPE> case ZFS enables the write cache. Note that even with the write cache RBPE> enabled, when the spa_sync() completes, it will be after a flush of RBPE> the cache has been executed. So the 60MB/sec do correspond to data set RBPE> to the platter. I just tried disabling the cache (with format -e) but RBPE> I am not sure if that is taken into account by ZFS; Results are the RBPE> same 60MB/sec. This will have to be confirmed. RBPE> With write cache enabled, the physio test reaches 66 MB/s as soon as RBPE> we are issuing 16KB I/Os. Here clearly though, data is not on the RBPE> platter when the timed function completes. RBPE> Another variable not fully controled is the physical (cylinder) RBPE> locations of the I/O. It could be that some of the differences come RBPE> from that. RBPE> What do I take away ? RBPE> a single 2MB physical I/O will get 46 MB/sec out of my disk. RBPE> 35 concurrent 128K I/O sustained followed by metadata I/O RBPE> followed by flush of the write cache allows ZFS to get 60 RBPE> MB/sec out of the same disk. RBPE> This is what underwrites my belief that 128K blocksize is sufficiently RBPE> large. Now, nothing here proves that 256K would not give more RBPE> throughput; so nothing is really settled. But I hope this helps put us RBPE> on common ground. This is really interesting because what I see here with very similar test is the opposite. What kind of disk do you use? (mine is 15K 73GB FCm connected with dual path to host with MPxIO). I use iostat to see actual throughput - you Dtrace - maybe we measure different things? -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss