Re: [zfs-discuss] Digging in the bowels of ZFS

Jim Klimov Mon, 03 Dec 2012 09:24:25 -0800

On 2012-12-02 05:42, Jim Klimov wrote:

So... here are some applied questions:


Well, I am ready to reply a few of my own questions now :)

I've staged an experiment by taking a 128Kb block from that file
and appending it to a new file in a test dataset, where I changed
the compression settings between the appendages. Thus I've got
a ZDB dump of three blocks with identical logical userdata and
different physical data.

# zdb -ddddd -bbbbbb -e 1601233584937321596/test3 8 > /pool/test3/a.zdb
...
Indirect blocks:
               0 L1  DVA[0]=<0:59492a98000:3000> DVA[1]=
<0:83e2f65000:3000> [L1 ZFS plain file] sha256 lzjb LE contiguous
unique double size=4000L/400P birth=326381727L/326381727P fill=3
cksum=2ebbfb189e7ce003:166a23fd39d583ed:f527884977645395:896a967526ea9cea

               0  L0 DVA[0]=<0:590002c1000:30000> [L0 ZFS plain file]
sha256 uncompressed LE contiguous unique single size=20000L/20000P
birth=326381721L/326381721P fill=1
cksum=3c691e8fc86de2ea:90a0b76f0d1fe3ff:46e055c32dfd116d:f2af276f0a6a96b9

           20000  L0 DVA[0]=<0:594928b8000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous unique single size=20000L/4800P
birth=326381724L/326381724P fill=1
cksum=57164faa0c1cbef4:23348aa9722f47d3:3b1b480dc731610b:7f62fce0cc18876f

           40000  L0 DVA[0]=<0:59492a92000:6000> [L0 ZFS plain file]
sha256 gzip-9 LE contiguous unique single size=20000L/2800P
birth=326381727L/326381727P fill=1
cksum=d68246ee846944c6:70e28f6c52e0c6ba:ea8f94fc93f8dbfd:c22ad491c1e78530

                segment [0000000000000000, 0000000000080000) size  512K


>1)   So... how DO I properly interpret this to select sector ranges to
>     DD into my test area from each of the 6 disks in the raidz2 set?
>
>     On one hand, the DVA states the block length is 0x9000, and this
>     matches the offsets of neighboring blocks.
>
>     On the other hand, compressed "physical" data size is 0x4c00 for
>     this block, and ranges 0x4800-0x5000 for other blocks of the file.
>     Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way
>     smaller than 0x9000. For uncompressed files I think I saw entries
>     like "size=20000L/30000P", so I'm not sure even my multiplication
>     by 1.5x above is valid, and the discrepancy between DVA size and
>     interval, and "physical" allocation size reaches about 2x.

Apparently, my memory failed me. The values in "size" field regard the
userdata (compressed, non-redundant). Also I forgot to consider that
this pool uses 4KB sectors (ashift=12).

So my userdata which takes up about 0x4800 bytes would require 4.5
(rather, 5 whole) sectors and this warrants 4 sectors of the raidz2
redundancy on a 6-disk set - 2 sectors for the first 4 data sectors,
and 2 sectors for the remaining half-sector's worth of data. This
does sum up to 9*0x1000 bytes in whole-sector counting (as in offsets).

However, the gzip-compressed block above which only has 0x2800 bytes
of userdata and requires 3 sectors plus 2 redundancy sectors, still
has a DVA size of six 4KB sectors (0x6000). This is strange to me -
I'd expect 5 sectors for this block altogether... does anyone have
an explanation? Also, what should the extra userdata sector contain
physically - zeroes?

> 5) Is there any magic to the checksum algorithms? I.e. if I pass
>     some 128KB block's logical (userdata) contents to the command-line
>     "sha256" or "openssl sha256" - should I get the same checksum as
>     ZFS provides and uses?

The original 128KB file's sha256 checksum matches the uncompressed
block's ZFS checksum, so in my further tests I can use the command
line tools to verify the recombined results:

# sha256sum /tmp/b128
3c691e8fc86de2ea90a0b76f0d1fe3ff46e055c32dfd116df2af276f0a6a96b9  /tmp/b128

No magic, as long as there are useable command-line implementations
of the needed algos (sha256sum is there, fletcher[24] are not).

> 6) What exactly does a checksum apply to - the 128Kb userdata block
>     or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the
>     latter, but ask just in case I don't miss anything... :)

ZFS parent block checksum applies to the on-disk variant of userdata
payload (compression included, redundancy excluded).



NEW QUESTIONS:

7) Is there a command-line tool to do lzjb compressions and
decompressions (in the same blocky manner as would be applicable
to ZFS compression)?

I've also tried to gzip-compress the original 128KB file, but
none of the compressed results (with varying gzip level) yielded
the same checksum that would match the ZFS block's one.
Zero-padding to 10240 bytes (psize=0x2800) did not help.


8) When should the decompression stop - as soon as it has extracted
the logical-size number of bytes (i.e. 0x20000)?


9) Physical sizes magically are in whole 512b units, so it seems...
I doubt that the compressed data would always end at such boundary.

How many bytes should be covered by a checksum?
Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)?



Some OLD questions remain raised, just in case anyone answers them.

2) Do I understand correctly that for the offset definition, sectors
    in a top-level VDEV (which is all of my pool) are numbered in rows
    per-component disk? Like this:
          0  1  2  3  4  5
          6  7  8  9  10 11...

    That is, "offset % setsize = disknum"?

    If true, does such numbering scheme apply all over the TLVDEV,
    so as for my block on a 6-disk raidz2 disk set - its sectors
    start at (roughly rounded) "offset_from_DVA / 6" on each disk,
    right?

3) Then, if I read the ZFS on-disk spec correctly, the sectors of
    the first disk holding anything from this block would contain the
    raid-algo1 permutations of the four data sectors, sectors of
    the second disk contain the raid-algo2 for those 4 sectors,
    and the remaining 4 disks contain the data sectors?
    The redundancy algos should in fact cover other redundancy disks
    too (in order to sustain loss of any 2 disks), correct? (...)

4) Where are the redundancy algorithms specified? Is there any simple
    tool that would recombine a given algo-N redundancy sector with
    some other 4 sectors from a 6-sector stripe in order to try and
    recalculate the sixth sector's contents? (Perhaps part of some
    unit tests?)


I'm almost ready to go and test Q2 and Q3, however, the questions
which regard useable tools (and "what data should be fed into such
tools?") are still on the table.

Thanks a lot in advance for any info, ideas, insights,
and just for reading this long post to the end ;)
//Jim Klimov


Ditto =)
//Jim

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

Reply via email to