On 2012-12-20 18:25, sol wrote:
Hi

I know some of this has been discussed in the past but I can't quite
find the exact information I'm seeking
(and I'd check the ZFS wikis but the websites are down at the moment).

Firstly, which is correct, free space shown by "zfs list" or by "zpool
iostat" ? (...)
(That's a big difference, and the percentage doesn't agree)

I believe, zpool iostat (and zpool list) report raw storage accounting,
basically - the number of HDD sectors available and consumed, including
redundancy and metadata (so available space also includes the unused-yet
redundancy overhead), and the reserved space (like 1/64 of the pool size
for system use - including attempts to counter the said performance
degradation on full pools).

zfs list displays user-data info - what is available after redundancy
and system reservations, and in general subject to "(ref)reservation"
and "(ref)quota" on datasets in the pool. When cloning and dedup come
into play as well as compression, this accounting becomes tricky.
Overall, there is one number you can trust: the used space in a dataset
says how much userdata (including directory structures, but also after
compression) is referenced in this filesystem, if you limit or bill by
consumption - the end-user value of your service. This does not mean
that only this filesystem references these blocks, though. And the
other numbers are more vague (i.e. with good dedup+compress ratios you
can sum up the used spaces to much more than the raw pool sizes).


Secondly, there's 8 vdevs each of 11 disks.
6 vdevs show used 8.19 TB, free 1.81 TB, free = 18.1%
2 vdevs show used 6.39 TB, free 3.61 TB, free = 36.1%

How did you look that up? ;)


I've heard that
a) performance degrades when free space is below a certain amount

Basically, the "mechanics" of the degradation is that ZFS writes new
data into available space "bubbles" within a range called "metaslab".
It tries to make sequential writes to do stuff faster. If your pool
has seen lots of writes and deletions, its free spaces may have become
fragmented, so search for the "bubbles" takes longer, and they are too
small to fit the whole incoming transaction - leading to more HDD seeks
and thus more latency on write. In extreme, ZFS can't even find holes
big enough for a block, so it splits the block data into several pieces
and writes "gang blocks", using many tiny IOs with many mechanical HDD
seeks.

Numbers - how full is a pool to display these problems - are highly
individual. Some pools saw it after filling to 60%, typical is 80-90%,
and for write-only pools you might never see this problem because you
don't delete stuff (well, except maybe for metadata during updates,
all of which usually consumes 1-3% of total allocation).

b) data is written to different vdevs depending on free space

There are several rules which influence the preference of a Top-level
VDEV and of a metaslab region inside it, which probably include free
space, known presence of large "bubbles" to write into, and location
on the disk (slower-faster LBA tracks).


So a) how do I determine the exact value when performance degrades and
how significant is it?
b) has that threshold been reached (or exceeded?) in the first six vdevs?
and if so are the two emptier vdevs being used exclusively to prevent
performance degrading
so it will only degrade when all vdevs reach the magic 18.1% free (or
whatever it is)?

Hopefully, this was answered above :)

Presumably there's no way to identify which files are on which vdevs in
order to delete them and recover the performance?

It is possible, but not simple, and is not guaranteed to get the
result you want (though there is little harm in trying).

You can use "zdb" to extract information about an inode on a dataset
as a listing of block pointer entries which form a tree for this file.

For example:
# ls -lani /lib/libnsl.so.1
9239 -rwxr-xr-x   1 0    2     649720 Jun  8  2012 /lib/libnsl.so.1

# df -k /lib/libnsl.so.1
Filesystem            kbytes    used   avail capacity  Mounted on
rpool/ROOT/oi_151a4  61415424  452128 24120824     2%    /

Here the first number from "ls -i" gives us the inode of the file,
and the "df" confirms the dataset name. So we can zdb walk:

# zdb -ddddd -bbbbbb rpool/ROOT/oi_151a4 9239
Dataset rpool/ROOT/oi_151a4 [ZPL], ID 5299, cr_txg 1349648, 442M,
8213 objects, rootbp DVA[0]=<0:a6921d600:200> DVA[1]=<0:2ffc7b400:200>
[L0 DMU objset] fletcher4 lzjb LE contiguous unique double
size=800L/200P birth=4682209L/4682209P fill=8213
cksum=16f122cb05:77d20eea7b8:155c69ed5a6ce:2b90104e19641f

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
      9239    2    16K   128K   642K   640K  100.00  ZFS plain file
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 4
        path    /lib/libnsl.so.1
        uid     0
        gid     2
        atime   Fri Jun  8 00:22:17 2012
        mtime   Fri Jun  8 00:22:17 2012
        ctime   Fri Jun  8 00:22:17 2012
        crtime  Fri Jun  8 00:22:17 2012
        gen     1349746
        mode    100755
        size    649720
        parent  25
        links   1
        pflags  40800000104
Indirect blocks:
               0 L1  DVA[0]=<0:940298000:400> DVA[1]=<0:263234a00:400>
[L1 ZFS plain file] fletcher4 lzjb LE contiguous unique double
size=4000L/400P birth=1349746L/1349746P fill=5
cksum=682d4fda0b:3cc1aa306094:13ebb22837cf14:4c5c67e522dbca8

               0  L0 DVA[0]=<0:95f337000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=23fce6aa160b:5ab11e5fcbc6c2e:5b38f230e01d508d:12cf92941e4b2487

           20000  L0 DVA[0]=<0:95f357000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=3f0ac207affd:f8ed413113d6bdd:24e36c7682cfc297:2549c866ab61e464

           40000  L0 DVA[0]=<0:95f377000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=3d40bf3329f0:f459bc876303dd7:2230ee348b7b08c5:3a65d1ebbf52c9dc

           60000  L0 DVA[0]=<0:95f397000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=19e01b53eb67:956b52d1df6ecd4:38ff9bd1302bf879:e4661798dd1ae8a0

           80000  L0 DVA[0]=<0:95f3b7000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=361e6fd03d40:d0903e491fa09e9:7a2e453ed28baa92:28562c53af3c0495

                segment [0000000000000000, 00000000000a0000) size  640K

After several higher layers of the pointers (just L1 in example above),
you have "L0" entries which point to actual data blocks with their DVA
fields.

The example file above fits in five 128K blocks at level L0.

The first component of the DVA address is the top-level vdev ID,
followed by offset and allocation size (including raidzN redundancy).
Depending on your pool's history, larger files may have been striped
over several TLVDEVs however, and relocating them (copying over and
deleting the original) might help or not help free up a particular
TLVDEV (upon rewrite they will be striped again, albeit maybe ZFS
will make different decisions upon a new write - and prefer the more
free devices).

Also, if the file's blocks are referenced via snapshots, clones,
dedup or hardlinks, they won't actually be released when you delete
a particular copy of the file.

HTH,
//Jim Klimov

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to