On 2012-12-02 05:42, Jim Klimov wrote:
So... here are some applied questions:
Well, I am ready to reply a few of my own questions now :) I've staged an experiment by taking a 128Kb block from that file and appending it to a new file in a test dataset, where I changed the compression settings between the appendages. Thus I've got a ZDB dump of three blocks with identical logical userdata and different physical data. # zdb -ddddd -bbbbbb -e 1601233584937321596/test3 8 > /pool/test3/a.zdb ... Indirect blocks: 0 L1 DVA[0]=<0:59492a98000:3000> DVA[1]= <0:83e2f65000:3000> [L1 ZFS plain file] sha256 lzjb LE contiguous unique double size=4000L/400P birth=326381727L/326381727P fill=3 cksum=2ebbfb189e7ce003:166a23fd39d583ed:f527884977645395:896a967526ea9cea 0 L0 DVA[0]=<0:590002c1000:30000> [L0 ZFS plain file] sha256 uncompressed LE contiguous unique single size=20000L/20000P birth=326381721L/326381721P fill=1 cksum=3c691e8fc86de2ea:90a0b76f0d1fe3ff:46e055c32dfd116d:f2af276f0a6a96b9 20000 L0 DVA[0]=<0:594928b8000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous unique single size=20000L/4800P birth=326381724L/326381724P fill=1 cksum=57164faa0c1cbef4:23348aa9722f47d3:3b1b480dc731610b:7f62fce0cc18876f 40000 L0 DVA[0]=<0:59492a92000:6000> [L0 ZFS plain file] sha256 gzip-9 LE contiguous unique single size=20000L/2800P birth=326381727L/326381727P fill=1 cksum=d68246ee846944c6:70e28f6c52e0c6ba:ea8f94fc93f8dbfd:c22ad491c1e78530 segment [0000000000000000, 0000000000080000) size 512K >1) So... how DO I properly interpret this to select sector ranges to > DD into my test area from each of the 6 disks in the raidz2 set? > > On one hand, the DVA states the block length is 0x9000, and this > matches the offsets of neighboring blocks. > > On the other hand, compressed "physical" data size is 0x4c00 for > this block, and ranges 0x4800-0x5000 for other blocks of the file. > Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way > smaller than 0x9000. For uncompressed files I think I saw entries > like "size=20000L/30000P", so I'm not sure even my multiplication > by 1.5x above is valid, and the discrepancy between DVA size and > interval, and "physical" allocation size reaches about 2x. Apparently, my memory failed me. The values in "size" field regard the userdata (compressed, non-redundant). Also I forgot to consider that this pool uses 4KB sectors (ashift=12). So my userdata which takes up about 0x4800 bytes would require 4.5 (rather, 5 whole) sectors and this warrants 4 sectors of the raidz2 redundancy on a 6-disk set - 2 sectors for the first 4 data sectors, and 2 sectors for the remaining half-sector's worth of data. This does sum up to 9*0x1000 bytes in whole-sector counting (as in offsets). However, the gzip-compressed block above which only has 0x2800 bytes of userdata and requires 3 sectors plus 2 redundancy sectors, still has a DVA size of six 4KB sectors (0x6000). This is strange to me - I'd expect 5 sectors for this block altogether... does anyone have an explanation? Also, what should the extra userdata sector contain physically - zeroes? > 5) Is there any magic to the checksum algorithms? I.e. if I pass > some 128KB block's logical (userdata) contents to the command-line > "sha256" or "openssl sha256" - should I get the same checksum as > ZFS provides and uses? The original 128KB file's sha256 checksum matches the uncompressed block's ZFS checksum, so in my further tests I can use the command line tools to verify the recombined results: # sha256sum /tmp/b128 3c691e8fc86de2ea90a0b76f0d1fe3ff46e055c32dfd116df2af276f0a6a96b9 /tmp/b128 No magic, as long as there are useable command-line implementations of the needed algos (sha256sum is there, fletcher[24] are not). > 6) What exactly does a checksum apply to - the 128Kb userdata block > or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the > latter, but ask just in case I don't miss anything... :) ZFS parent block checksum applies to the on-disk variant of userdata payload (compression included, redundancy excluded). NEW QUESTIONS: 7) Is there a command-line tool to do lzjb compressions and decompressions (in the same blocky manner as would be applicable to ZFS compression)? I've also tried to gzip-compress the original 128KB file, but none of the compressed results (with varying gzip level) yielded the same checksum that would match the ZFS block's one. Zero-padding to 10240 bytes (psize=0x2800) did not help. 8) When should the decompression stop - as soon as it has extracted the logical-size number of bytes (i.e. 0x20000)? 9) Physical sizes magically are in whole 512b units, so it seems... I doubt that the compressed data would always end at such boundary. How many bytes should be covered by a checksum? Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)? Some OLD questions remain raised, just in case anyone answers them.
2) Do I understand correctly that for the offset definition, sectors in a top-level VDEV (which is all of my pool) are numbered in rows per-component disk? Like this: 0 1 2 3 4 5 6 7 8 9 10 11... That is, "offset % setsize = disknum"? If true, does such numbering scheme apply all over the TLVDEV, so as for my block on a 6-disk raidz2 disk set - its sectors start at (roughly rounded) "offset_from_DVA / 6" on each disk, right? 3) Then, if I read the ZFS on-disk spec correctly, the sectors of the first disk holding anything from this block would contain the raid-algo1 permutations of the four data sectors, sectors of the second disk contain the raid-algo2 for those 4 sectors, and the remaining 4 disks contain the data sectors? The redundancy algos should in fact cover other redundancy disks too (in order to sustain loss of any 2 disks), correct? (...) 4) Where are the redundancy algorithms specified? Is there any simple tool that would recombine a given algo-N redundancy sector with some other 4 sectors from a 6-sector stripe in order to try and recalculate the sixth sector's contents? (Perhaps part of some unit tests?)
I'm almost ready to go and test Q2 and Q3, however, the questions which regard useable tools (and "what data should be fed into such tools?") are still on the table.
Thanks a lot in advance for any info, ideas, insights, and just for reading this long post to the end ;) //Jim Klimov
Ditto =) //Jim _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss