Jeffrey W. Baker wrote: > # zfs set recordsize=2K tank/bench > # randomio bigfile 10 .25 .01 2048 60 1 > > total | read: latency (ms) | write: latency (ms) > iops | iops min avg max sdev | iops min avg max > sdev > --------+-----------------------------------+---------------------------------- > 463.9 | 346.8 0.0 21.6 761.9 33.7 | 117.1 0.0 21.3 883.9 > 33.5 > > Roughly the same as when the RS was 128K. But, if I set the RS to 2K > before creating bigfile: > > total | read: latency (ms) | write: latency (ms) > iops | iops min avg max sdev | iops min avg max > sdev > --------+-----------------------------------+---------------------------------- > 614.7 | 460.4 0.0 18.5 249.3 14.2 | 154.4 0.0 9.6 989.0 > 27.6 > > Much better! Yay! So I assume you would always set RS=8K when using > PostgreSQL, etc?
I presume these are something like Seagate DB35.3 series SATA 400 GByte drives? If so, then the spec'ed average read seek time is < 11 ms and rotational delay is 7,200 rpm. So the theoretical peak random read rate per drive is ~66 iops. http://www.seagate.com/ww/v/index.jsp?vgnextoid=01117ea70fafd010VgnVCM100000dd04090aRCRD&locale=en-US# For an 8-disk mirrored set, the max theoretical random read rate is 527 iops. I see you're getting 460, so you're at 87% of theoretical. Not bad. When writing, the max theoretical rate is a little smaller because of the longer seek time (see datasheet) so we can get ~62 iops per disk. Also, the total is divided in half because we have to write to both sides of the mirror. Thus the peak is 248 iops. You see 154 or 62% of peak. Not quite so good. But there is another behaviour here which is peculiar to ZFS. All writes are COW and allocated from free space. But this is done in 1 MByte chunks. For 2 kByte I/Os, that means you need to get to a very high rate before the workload is spread out across all of the disks simultaneously. You should be able to see this if you look at iostat with a small interval. For 8 kByte recordsize, you should see that it is easier to spread the wealth across all 4 mirrored pairs. For other RAID systems, you can vary the stripe interlace, usually to much smaller values, to help spread the wealth. It is difficult to predict how this will affect your application performance, though. For simultaneous reads and writes, 614 iops is pretty decent, but it makes me wonder if the spread is much smaller than the full disk. If the application only does 8 kByte iops, then I wouldn't even bother doing large, sequential workload testing... you'll never be able to approach that limit before you run out of some other resource, usually CPU or controller. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss