> RL> why is sum of disks bandwidth from `zpool iostat -v 1`
> RL> less than the pool total while watching `du /zfs`
> RL> on opensol-20060605 bits?
> 
> Due to raid-z implementation. See last discussion on raid-z
> performance, etc.

It's an artifact of the way raidz and the vdev read cache interact.

Currently, when you read a block from disk, we always read at least 64k.
We keep the result in a per-disk cache -- like a software track buffer.
The idea is that if you do several small reads in a row, only the first
one goes to disk.  For some workloads, this is a huge win.  For others,
it's a net lose.  More tuning is needed, certainly.

Both the good and the bad aspects of vdev caching are amplified by
RAID-Z.  When you write a 2k block to a 5-disk raidz vdev, it will be
stored as a single 512-byte sector on each disk (4 data + 1 parity).
When you read it back, we'll issue 4 reads (to the data disks);
each of those will become a 64k cache-fill read, so you're reading
a total of 4*64k = 256k to fetch a 2k block.  If that block is the
first in a series, you're golden: the next 127 reads will be free
(no disk I/O).  On the other hand, if it's an isolated random read,
we just did 128 times more I/O than was actually useful.

This is a rather extreme case, but it's real.  I'm hoping that by
making the higher-level prefetch logic in ZFS a little smarter,
we can eliminate the need for vdev-level caching altogether.
If not, we'll need to make the vdev cache policy smarter.

I've filed this bug to track the issue:

        6437054 vdev_cache: wise up or die

Jeff

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to