2012-05-18 19:08, Edward Ned Harvey wrote:
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

I'm reading the ZFS on-disk spec, and I get the idea that there's an
uberblock pointing to a self-balancing tree (some say b-tree, some say
avl-tree, some say nv-tree), where data is only contained in the nodes.  But
I haven't found one particular important detail yet:

On which values does the balancing tree balance?  Is it balancing on the
logical block address?  This would make sense, as an application requests to
read/write some logical block, making it easy and fast to find the
corresponding physical blocks...

My memory fails me here for a precise answer... I think that
the on-disk data within a raidzN top-level VDEV (mirrors are
trivial) is laid out as follows, for an arbitrary 6-disk set
of raidz2 TLVDEV:

D1   D2   D3   D4   D5   D6
Ar1  Ar2  Ad1  Ad2  Ad3  Ad4
Br1  Br2  Bd1  Cr1  Cr2  Cd1
Cd2  Cd3  Cd4  Cr3  Cr4  Cd5
Cd6  Dr1  Dr2  Dd1  ...

In these examples above, several blocks are laid out in sectors
of different disks, including the redundancy blocks. Sequential
accesses on one disk progress in a column from top to bottom.
Accesses in a row are parallelized between many disks.

The "A" block userdata is 4 sectors long, with 2 redundancy blocks.
The "B" block has just one userdata sector, and the "C" block has
6 userdata sectors with a redundancy started for each 4 sectors.

AFAIK each ZFS block fully resides within one TLVDEV (and ditto
copies have their own separate life in another TLVDEV if available),
and striping over several TLVDEVs occurs at a whole-block level.
This, in particular, allows disbalanced pools with TLVDEVs of
different size and layout.

IF this picture is correct (confirmation or the reverse is
kindly requested), then:

1) DVA to LBA translation should be somewhat trivial, since
   the DVA is defined as "ID(tlvdev):offset:length" in 512-byte
   units (regardless of ashift value on the pool). I did not
   test this in practice or incur from the code though.

   I don't know if there are any gaps to take into account
   (i.e. maybe between "metaslabs", which are supposed to be
   about 200 of which per vdev (or tlvdev, or pool?) in order to
   limit seeking between data written at roughly the same time.
   Even if there are gaps (i.e. to round allocations to on-disk
   tracks or offsets at multiples of a given number), I'd not
   complicate things and just leave the gaps as addressable but
   unreferenced free spaces.

   A poster on the list recently referenced "slabs", I don't
   think I saw this term - but I guess it stands for the total
   allocation needed for a userdata block?

2) Addressing of blocks (or reverse - saying that these sectors
   belong to a particular block or are available) is impossible
   without knowing the (generally whole) blockpointer tree, and
   depending on (re-)written object sizes, the same sector can
   at different times in its life belong to blocks (slabs?) of
   different lengths and starting at different DVA offsets...

   Indeed, we also can not assume that sectors read-in from the
   disks contain a valid part of the blockpointer tree (despite
   even matching some magic number), not until we find a path
   through the known tree that leads to this block (I discussed
   this in my other post regarding vdev prefetch and defrag).
   However since reads are free as long as the HDD head is in
   the right location, and if blkptr_t's leading one to another
   are colocated on the disk, clever use of the prefetch and
   timely inspection of the prefetch cache can hopefully boost
   the BPtree walking speed.

   MAYBE I am wrong in this and there is also an allocation map
   in the large metaslabs or something? (I know there is some
   cleverness about finding available locations to write into,
   but I'm not ready to speak about it off the top of my head).

   I am not sure if this gives a clue to whether it's "balancing
   on the logical block address?" though :) AFAIK the balancing
   tries to keep the maximum tree depth shortest, yet there is
   one root block and no rewriting of existing unchanged stale
   blocks (tree nodes). I am puzzled too :)


3) The layout is fixed at tlvdev creation time by its total
   number of disks since that directly affects the calculation
   "on which disk does an offset'ed sector belong" - it would
   be offset modulo number of disks for raidzN regardless of N
   (because of not-full stripes), and just 0 for single drives
   and mirrors. This is why resizing a raidz set is indeed hard,
   while conversion of single disks to mirrors and back is easy.

   To a lesser extent the layout is limited by vdev size (which
   can be increased easily, but can not be decreased without
   reallocation and BP rewrite [*1]), and somewhat by the number
   of redundancy disks which influences individual blocks' on-disk
   representation and required length [*2].

[*1]: This might be doable relatively easily by limiting the top
   writeable address, and executing a routine similar to zfs-send
   and zfs-recv to relocate all blocks with a larger DVA offset
   on this TLVDEV to any accessible location on the pool. When
   no more referenced blocks remain above the watermark, the
   TLVDEV can be shrunk. This may involve some magic with TXG
   "birth" and "alloc" fields in blkptr's as well.

[*2]: As we know, in Oracle ZFS there is hybrid allocation,
   which in particular allows mirrored writes for metadata
   and raidz writes for userdata to coexist on a pool.
   I can only guess there is some new bit-flag in the blkptr_t
   for that? Anyhow, the number of redundancy disks and the
   layout algorithm for a particular block can be variable,
   so it seems...

Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to