The question put forth is whether the ZFS 128K blocksize is sufficient to saturate a regular disk. There is great body of evidence that shows that the bigger the write sizes and matching large FS clustersize lead to more throughput. The counter point is that ZFS schedules it's I/O like nothing else seen before and manages to sature a single disk using enough concurrent 128K I/O.
<There a few things I did here for the first time; so I may have erred at places. So I am proposing this for review by the community> I first measured the throughput of a write(2) to raw device using for instance this; dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024 On Solaris we would see some overhead of reading the block from /dev/zero and then issuing the write call. The tightest function that fences the I/O is default_physio(). That function will issue the I/O to the device then wait for it to complete. If we take the elapse time spent in this function and count the bytes that are I/O-ed, this should give a good hint as to the throughput the device is providing. The above dd command will issue a single I/O at a time (d-script to measure is attached). Trying different blocksizes I see: Bytes Elapse of phys IO Size Sent 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s Now lets see what ZFS gets. I measure using single dd process. ZFS will chunk up data in 128K blocks. Now the dd command interact with memory. But the I/O are scheduled under the control of spa_sync(). So in the d-script (attached) I check for the start of an spa_sync and time that based on elapse. At the same time I gather the number of bytes and keep a count of I/O (bdev_strategy) that are being issued. When the spa_sync completes we are sure that all those are on stable storage. The script is a bit more complex because there are 2 threads that issue spa_sync, but only one of them actually becomes activated. So the script will print out some spurious lines of output at times. I measure I/O with the script while this runs: dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000 And I see: 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s OK, I cheated. Here, ZFS is given a full disk to play with. In this case ZFS enables the write cache. Note that even with the write cache enabled, when the spa_sync() completes, it will be after a flush of the cache has been executed. So the 60MB/sec do correspond to data set to the platter. I just tried disabling the cache (with format -e) but I am not sure if that is taken into account by ZFS; Results are the same 60MB/sec. This will have to be confirmed. With write cache enabled, the physio test reaches 66 MB/s as soon as we are issuing 16KB I/Os. Here clearly though, data is not on the platter when the timed function completes. Another variable not fully controled is the physical (cylinder) locations of the I/O. It could be that some of the differences come from that. What do I take away ? a single 2MB physical I/O will get 46 MB/sec out of my disk. 35 concurrent 128K I/O sustained followed by metadata I/O followed by flush of the write cache allows ZFS to get 60 MB/sec out of the same disk. This is what underwrites my belief that 128K blocksize is sufficiently large. Now, nothing here proves that 256K would not give more throughput; so nothing is really settled. But I hope this helps put us on common ground. -r
phys.d
Description: Binary data
spa_sync.d
Description: Binary data
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss