Re[7]: [zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can't saturate disks

Robert Milkowski Thu, 18 May 2006 17:27:22 -0700

Hello Roch,

Monday, May 15, 2006, 3:23:14 PM, you wrote:


RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient
RBPE> to saturate a regular disk. There is great body of evidence that shows
RBPE> that the bigger the write sizes and matching large FS clustersize lead
RBPE> to more throughput. The counter point is that ZFS schedules it's I/O
RBPE> like nothing else seen before and manages to sature a single disk
RBPE> using enough concurrent 128K I/O.

Nevertheless I get much more throughput using UFS and writing with
large block than using ZFS on the same disk. And the difference is
actually quite big in favor of UFS.


RBPE> <There a few things I did here for the first time; so I may have erred
RBPE> at places. So I am proposing this for review by the community>

RBPE> I first measured the throughput of a write(2)  to raw device using for
RBPE> instance this;

RBPE>         dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

RBPE> On   Solaris we would  see  some overhead of   reading  the block from
RBPE> /dev/zero and then issuing the write call.  The tightest function that
RBPE> fences the I/O is default_physio(). That  function will issue the I/O to
RBPE> the device then wait for it  to complete.  If  we take the elapse time
RBPE> spent in   this function and  count the  bytes that   are I/O-ed, this
RBPE> should give  a   good  hint as   to   the throughput  the    device is
RBPE> providing.  The above  dd command will  issue  a single I/O at  a time
RBPE> (d-script to measure is attached).

RBPE> Trying different blocksizes I see:

RBPE>    Bytes   Elapse of phys IO     Size       
RBPE>    Sent
RBPE>    
RBPE>    8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s

RBPE>    9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s

RBPE>    31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s

RBPE>    78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s

RBPE>    124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s

RBPE>    178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s

RBPE>    226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s

RBPE>    226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s

RBPE>     32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s

RBPE>    224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s

Just to be sure - you did reconfigure system to actually allow larger
IO sizes?

RBPE> Now lets see what  ZFS gets.  I measure using  single dd process.  ZFS
RBPE> will chunk up data  in 128K blocks.  Now  the dd command interact with
RBPE> memory. But the I/O are scheduled under the control of spa_sync().  So
RBPE> in  the d-script (attached) I check  for the start  of an spa_sync and
RBPE> time that based on elapse.  At the same  time I  gather the number  of
RBPE> bytes and keep  a count of I/O (bdev_strategy)  that are being issued.
RBPE> When the spa_sync completes we are  sure that all  those are on stable
RBPE> storage. The script is a bit more  complex because there are 2 threads
RBPE> that   issue  spa_sync, but  only     one  of them actually    becomes
RBPE> activated. So the script will print out  some spurious lines of output
RBPE> at times. I measure I/O with the script while this runs:


RBPE>         dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

RBPE> And I see:

RBPE>    1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
RBPE>    1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


RBPE> OK, I  cheated. Here, ZFS is given  a full disk  to play with. In this
RBPE> case ZFS enables the write cache. Note  that even with the write cache
RBPE> enabled, when the spa_sync()  completes, it will  be after a  flush of
RBPE> the cache has been executed. So the 60MB/sec do correspond to data set
RBPE> to the platter. I just tried disabling the cache  (with format -e) but
RBPE> I  am not sure if  that is taken into account  by ZFS; Results are the
RBPE> same 60MB/sec. This will have to be confirmed.

RBPE> With write cache enabled,  the physio test reaches 66  MB/s as soon as
RBPE> we are issuing 16KB I/Os.   Here clearly though,  data  is not on  the
RBPE> platter when the timed function completes.

RBPE> Another variable  not  fully  controled  is the   physical  (cylinder)
RBPE> locations of  the I/O. It could be  that some of the  differences come
RBPE> from that.

RBPE> What do I take away ?

RBPE>         a single 2MB physical I/O will get 46 MB/sec out of my disk.

RBPE>         35  concurrent  128K I/O  sustained  followed  by metadata I/O
RBPE>         followed by flush  of  the write cache  allows  ZFS to get  60
RBPE>         MB/sec out of the same disk.


RBPE> This is what underwrites my belief that 128K blocksize is sufficiently
RBPE> large. Now, nothing  here proves    that  256K would not give     more
RBPE> throughput; so nothing is really settled. But I hope this helps put us
RBPE> on common ground.

This is really interesting because what I see here with very similar
test is the opposite. What kind of disk do you use? (mine is 15K 73GB
FCm connected with dual path to host with MPxIO).

I use iostat to see actual throughput - you Dtrace - maybe we measure
different things?





-- 
Best regards,
 Robert                            mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re[7]: [zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can't saturate disks

Reply via email to