Robert Says: Just to be sure - you did reconfigure system to actually allow larger IO sizes? Sure enough, I messed up (I had no tuning to get the above data); So 1 MB was my max transfer sizes. Using 8MB I now see: Bytes Elapse of phys IO Size Sent 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46 MB/s) 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46 MB/s) 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47 MB/s) 272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new data) 288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data) Data was corrected after it was pointed out that, physio will be throttled by maxphys. New data was obtained after settings /etc/system: set maxphys=8338608 /kernel/drv/sd.conf sd_max_xfer_size=0x800000 /kernel/drv/ssd.cond ssd_max_xfer_size=0x800000 And setting un_max_xfer_size in "struct sd_lun". That address was figured out using dtrace and knowing that sdmin() calls ddi_get_soft_state (details avail upon request). And of course disabling the write cache (using format -e) With this in place I verified that each sdwrite() up to 8M would lead to a single biodone interrupts using this: dtrace -n 'biodone:entry,sdwrite:[EMAIL PROTECTED], stack(20)]=count()}' Note that for 16M and 32M raw device writes, each default_physio will issue a series of 8M I/O. And so we don't expect any more throughput from that. The script used to measure the rates (phys.d) was also modified since I was counting the bytes before the I/O had completed and that made a big difference for the very large I/O sizes. If you take the 8M case, the above rates correspond to the time it takes to issue and wait for a single 8M I/O to the sd driver. So this time certainly does include 1 seek and ~ 0.13 seconds of data transfer, then the time to respond to the interrupt, finally the wakeup of the thread waiting in default_physio(). Given that the data transfer rate using 4 MB is very close to the one using 8 MB, I'd say that at 60 MB/sec all the fixed-cost element are well amortized. So I would conclude from this that the limiting factor is now at the device itself or on the data channel between the disk and the host. Now recall that the throughput that ZFS gets during an spa_sync when submitted to a single dd and knowing that ZFS will work with 128K I/O: 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s My disk is <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>. As you say, we don't measure things the same way. At the dd to raw level I think our data, with my mistake corrected, will now be similar. At the ZFS level, we cannot use iostat quite _yet_ because of 6415647 Sequential writing is jumping With iostat, the 1 second average will see, at times, some period in which we won't issue any I/O. So it's not a good measure of the capacity of a disk. This is why I reverted to my script which times the I/O rate but only "when it counts". When we fix 6415647, the expectation is that we will sustain that throughput for whatever times is necessary. At that point, I expect the throughput as seen from iostat or the throughput from a ptime of dd itself will all converge. And so, after a moment of doubt, I am still inclined to believe that 128K I/Os, when issued properly can lead to, if not saturation, a very good throughput from a basic disk. Now, Anton's demonstration is convincing in it's own way. I can concur that any seek time is unproductive and will degrade throughput at the device level. But if the weak link is the data transfer rate between the device and the host, then it can be argued that the seek time can actually be hidden behind some data transfer time ? At 60MB/sec, a 128K data transfer takes 2ms which maybe is sufficient to get the head to the next block ? My disk does reach > 450 IOPS when controled by ZFS so it all adds up. Bear in mind also, that the throughput is not the only consideration when setting the ZFS recordsize. The smaller the record size the more manageable the disk block will be. So everything is a tradeoff and at this point 128K appears sufficiently large ... at least for a while. -r ____________________________________________________________________________________ Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L'Europe, 38330, Montbonnot Saint Martin, France Performance & Availability Engineering http://icncweb.france/~rbourbon http://blogs.sun.com/roller/page/roch [EMAIL PROTECTED] (+33).4.76.18.83.20 New scripts to measure dd to raw throughput:
phys.d
Description: Binary data
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss