2012-05-18 19:08, Edward Ned Harvey wrote:
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov
I'm reading the ZFS on-disk spec, and I get the idea that there's an
uberblock pointing to a self-balancing tree (some say b-tree, some say
avl-tree, some say nv-tree), where data is only contained in the nodes. But
I haven't found one particular important detail yet:
On which values does the balancing tree balance? Is it balancing on the
logical block address? This would make sense, as an application requests to
read/write some logical block, making it easy and fast to find the
corresponding physical blocks...
My memory fails me here for a precise answer... I think that
the on-disk data within a raidzN top-level VDEV (mirrors are
trivial) is laid out as follows, for an arbitrary 6-disk set
of raidz2 TLVDEV:
D1 D2 D3 D4 D5 D6
Ar1 Ar2 Ad1 Ad2 Ad3 Ad4
Br1 Br2 Bd1 Cr1 Cr2 Cd1
Cd2 Cd3 Cd4 Cr3 Cr4 Cd5
Cd6 Dr1 Dr2 Dd1 ...
In these examples above, several blocks are laid out in sectors
of different disks, including the redundancy blocks. Sequential
accesses on one disk progress in a column from top to bottom.
Accesses in a row are parallelized between many disks.
The "A" block userdata is 4 sectors long, with 2 redundancy blocks.
The "B" block has just one userdata sector, and the "C" block has
6 userdata sectors with a redundancy started for each 4 sectors.
AFAIK each ZFS block fully resides within one TLVDEV (and ditto
copies have their own separate life in another TLVDEV if available),
and striping over several TLVDEVs occurs at a whole-block level.
This, in particular, allows disbalanced pools with TLVDEVs of
different size and layout.
IF this picture is correct (confirmation or the reverse is
kindly requested), then:
1) DVA to LBA translation should be somewhat trivial, since
the DVA is defined as "ID(tlvdev):offset:length" in 512-byte
units (regardless of ashift value on the pool). I did not
test this in practice or incur from the code though.
I don't know if there are any gaps to take into account
(i.e. maybe between "metaslabs", which are supposed to be
about 200 of which per vdev (or tlvdev, or pool?) in order to
limit seeking between data written at roughly the same time.
Even if there are gaps (i.e. to round allocations to on-disk
tracks or offsets at multiples of a given number), I'd not
complicate things and just leave the gaps as addressable but
unreferenced free spaces.
A poster on the list recently referenced "slabs", I don't
think I saw this term - but I guess it stands for the total
allocation needed for a userdata block?
2) Addressing of blocks (or reverse - saying that these sectors
belong to a particular block or are available) is impossible
without knowing the (generally whole) blockpointer tree, and
depending on (re-)written object sizes, the same sector can
at different times in its life belong to blocks (slabs?) of
different lengths and starting at different DVA offsets...
Indeed, we also can not assume that sectors read-in from the
disks contain a valid part of the blockpointer tree (despite
even matching some magic number), not until we find a path
through the known tree that leads to this block (I discussed
this in my other post regarding vdev prefetch and defrag).
However since reads are free as long as the HDD head is in
the right location, and if blkptr_t's leading one to another
are colocated on the disk, clever use of the prefetch and
timely inspection of the prefetch cache can hopefully boost
the BPtree walking speed.
MAYBE I am wrong in this and there is also an allocation map
in the large metaslabs or something? (I know there is some
cleverness about finding available locations to write into,
but I'm not ready to speak about it off the top of my head).
I am not sure if this gives a clue to whether it's "balancing
on the logical block address?" though :) AFAIK the balancing
tries to keep the maximum tree depth shortest, yet there is
one root block and no rewriting of existing unchanged stale
blocks (tree nodes). I am puzzled too :)
3) The layout is fixed at tlvdev creation time by its total
number of disks since that directly affects the calculation
"on which disk does an offset'ed sector belong" - it would
be offset modulo number of disks for raidzN regardless of N
(because of not-full stripes), and just 0 for single drives
and mirrors. This is why resizing a raidz set is indeed hard,
while conversion of single disks to mirrors and back is easy.
To a lesser extent the layout is limited by vdev size (which
can be increased easily, but can not be decreased without
reallocation and BP rewrite [*1]), and somewhat by the number
of redundancy disks which influences individual blocks' on-disk
representation and required length [*2].
[*1]: This might be doable relatively easily by limiting the top
writeable address, and executing a routine similar to zfs-send
and zfs-recv to relocate all blocks with a larger DVA offset
on this TLVDEV to any accessible location on the pool. When
no more referenced blocks remain above the watermark, the
TLVDEV can be shrunk. This may involve some magic with TXG
"birth" and "alloc" fields in blkptr's as well.
[*2]: As we know, in Oracle ZFS there is hybrid allocation,
which in particular allows mirrored writes for metadata
and raidz writes for userdata to coexist on a pool.
I can only guess there is some new bit-flag in the blkptr_t
for that? Anyhow, the number of redundancy disks and the
layout algorithm for a particular block can be variable,
so it seems...
Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss