2012-10-12 11:11, andy thomas wrote:
Great, thanks for the explanation! I didn't realise you could have a sort of 'stacked pyramid' vdev/pool structure.
Well, you can - the layers are "pool" - "top-level VDEVs" - "leaf VDEVs", though on trivial pools like single-disk ones, the layers kinda merge into one or two :) This should be described in the manpage in greater detail. So the pool stripes over Top-Level VDEVs (TLVDEVs), roughly by round-robining whole logical blocks upon write, and then each tlvdev depending on its redundancy configuration forms the sectors to be written onto its component leaf vdevs (low-level disks, partitions or slices, luns, files, etc.) Since full-stripe writes are not required by ZFS, smaller blocks can consume less sectors than there are leafs (disks) in a tlvdev, but this does not result in lost space "holes" nor in RMW cycles like on full-stripe RAID systems. If there's a free "hole" of contiguous logical addressing (roughly, striped across leaf vdevs within the tlvdev), where the userdata sectors (after optional compression) plus the redundancy sectors fit - it will be used. I guess it is because of this contiguous addressing that a tlvdev with raidzN can not (currently) change the number of component disks, and a pool can not decrease the number of tlvdevs. If you add new tlvdevs to an existing pool, the ZFS algorithms will try to put some more load on emptier tlvdevs and balance the writes, although according to discussions, this can still lead to disbalance and performance problems on particular installations. In fact, you can (although not recommended due to balancing reasons) have tlvdevs of mixed size (like in Freddie's example) and even of different structure (i.e. mixing raidz and mirrors or even single LUNs) by forcing the disk attachment. Note however that a loss of a tlvdev kills your whole pool, so don't stripe important data over single disks/luns ;) And you don't have control of what gets written where, so you'd also get an averaged performance mix of raidz and mirrors with unpredictable performance for particular userdata block's storage. HTH, //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss