> RL> why is sum of disks bandwidth from `zpool iostat -v 1` > RL> less than the pool total while watching `du /zfs` > RL> on opensol-20060605 bits? > > Due to raid-z implementation. See last discussion on raid-z > performance, etc.
It's an artifact of the way raidz and the vdev read cache interact. Currently, when you read a block from disk, we always read at least 64k. We keep the result in a per-disk cache -- like a software track buffer. The idea is that if you do several small reads in a row, only the first one goes to disk. For some workloads, this is a huge win. For others, it's a net lose. More tuning is needed, certainly. Both the good and the bad aspects of vdev caching are amplified by RAID-Z. When you write a 2k block to a 5-disk raidz vdev, it will be stored as a single 512-byte sector on each disk (4 data + 1 parity). When you read it back, we'll issue 4 reads (to the data disks); each of those will become a 64k cache-fill read, so you're reading a total of 4*64k = 256k to fetch a 2k block. If that block is the first in a series, you're golden: the next 127 reads will be free (no disk I/O). On the other hand, if it's an isolated random read, we just did 128 times more I/O than was actually useful. This is a rather extreme case, but it's real. I'm hoping that by making the higher-level prefetch logic in ZFS a little smarter, we can eliminate the need for vdev-level caching altogether. If not, we'll need to make the vdev cache policy smarter. I've filed this bug to track the issue: 6437054 vdev_cache: wise up or die Jeff _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss