Robert Says:

  Just to be sure - you did reconfigure system to actually allow larger
  IO sizes?


Sure enough, I messed up (I had no tuning to get the above data); So
1 MB was my max transfer sizes. Using 8MB I now see:

   Bytes   Elapse of phys IO     Size       
   Sent
   
   8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s
   9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s
   31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s
   78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s
   124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s
   178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s
   226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s
   226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46 MB/s)
    32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46 MB/s)
   224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47 MB/s)
   272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new  data)
   288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data)

Data  was corrected  after  it was pointed out   that, physio will  be
throttled by maxphys. New data was obtained after settings

        /etc/system: set maxphys=8338608
        /kernel/drv/sd.conf sd_max_xfer_size=0x800000
        /kernel/drv/ssd.cond ssd_max_xfer_size=0x800000

        And setting un_max_xfer_size in "struct sd_lun".
        That address was figured out using dtrace and knowing that
        sdmin() calls ddi_get_soft_state (details avail upon request).
        
        And of course disabling the write cache (using format -e)

        With this in place I verified that each sdwrite() up to 8M 
        would lead to a single biodone interrupts using this:

        dtrace -n 'biodone:entry,sdwrite:[EMAIL PROTECTED], stack(20)]=count()}'

        Note that for 16M and 32M raw device writes, each default_physio
        will issue a series of 8M I/O. And so we don't
        expect any more throughput from that.


The script  used  to measure  the  rates  (phys.d)  was also
modified since I was  counting the bytes  before the I/O had
completed and that made a  big difference for the very large
I/O sizes.

If you take the  8M case, the  above rates correspond to the
time it takes to issue  and wait for a  single 8M I/O to the
sd driver. So this time certainly does include  1 seek and ~
0.13 seconds  of data transfer, then  the time to respond to
the  interrupt, finally the wakeup  of the thread waiting in
default_physio(). Given that the data  transfer rate using 4
MB is very close to  the one using 8  MB, I'd say that at 60
MB/sec all the fixed-cost  element are well amortized.  So I
would conclude from this that  the limiting factor is now at
the  device itself or  on the data  channel between the disk
and the host.


Now recall that the throughput that ZFS gets during an
spa_sync when submitted to a single dd and knowing that ZFS
will work with 128K I/O:

   1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
   1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


My disk is

       <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>.

As you say, we don't measure things the same  way. At the dd
to raw level  I think our  data, with  my mistake corrected,
will now be similar.  At the ZFS level, we cannot use iostat
quite _yet_ because of

        6415647 Sequential writing is jumping

With  iostat, the 1 second  average will see, at times, some
period in  which we won't issue any  I/O. So it's not a good
measure of the capacity of a disk. This is why I reverted to
my script  which  times  the  I/O rate  but   only  "when it
counts". When we fix  6415647,  the expectation is  that  we
will  sustain    that  throughput for  whatever    times  is
necessary.   At that point,  I expect the throughput as seen
from iostat or the throughput from a ptime of dd itself will
all converge.

And  so,  after a moment of  doubt,  I am  still inclined to
believe that 128K I/Os, when issued properly can lead to, if
not saturation, a very good throughput from a basic disk.

Now, Anton's demonstration is convincing  in it's own way. I
can concur  that any  seek  time  is  unproductive and  will
degrade throughput at the device level. But if the weak link
is the  data transfer rate  between the device and the host,
then it can  be argued that   the seek time can actually  be
hidden behind  some data transfer time ?  At 60MB/sec, a 128K
data transfer takes 2ms which maybe is sufficient to get the
head to the next block ?  My disk does reach  > 450 IOPS when
controled by ZFS so it all adds up.


Bear  in mind also, that   the  throughput is  not the  only
consideration when setting  the ZFS recordsize.  The smaller
the record size the more manageable  the disk block will be.
So everything is  a tradeoff and  at this point 128K appears
sufficiently large ... at least for a while.


-r

____________________________________________________________________________________
Roch Bourbonnais                        Sun Microsystems, Icnc-Grenoble 
Senior Performance Analyst              180, Avenue De L'Europe, 38330, 
                                        Montbonnot Saint Martin, France
Performance & Availability Engineering  
http://icncweb.france/~rbourbon         http://blogs.sun.com/roller/page/roch
[EMAIL PROTECTED]               (+33).4.76.18.83.20



New scripts to measure dd to raw throughput:

Attachment: phys.d
Description: Binary data

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to