I was wondering if someone could explain why the DDT is seemingly (from empirical observation) kept in a huge number of individual blocks, randomly written across the pool, rather than just a large binary chunk somewhere.
Having been victim of the reaaaally long times it takes to destroy a dataset that has dedup=on, I was wondering why that was. From memory, when the destroy process was running, something like iopattern -r showed constant 99% random reads. This seems like a very wasteful approach to allocating blocks for the DDT. Having deleted the 900GB dataset, finally, I now only have around 152GB (allocated PSIZE) left deduped on that pool. # zdb -DD tank DDT-sha256-zap-duplicate: 310684 entries, size 578 on disk, 380 in core DDT-sha256-zap-unique: 1155817 entries, size 2438 on disk, 1783 in core So 1466501 DDT blocks. For 152GB of data, that's around 108KB/block on average, which seems sane. To destroy the dataset holding the files which reference the DDT, I'm looking at 1.46 million random reads to complete the operation (less those elements in ARC or L2ARC). That's a lot of read operations for my poor spindles. I've seen some people saying that the DDT blocks are around 270 bytes each, but does it really matter, if the smallest block that zfs can read/write (for obvious reasons) is 512 bytes? Clearly 2x 270B > 512B, but couldn't there be some way of grouping DDT elements together (in say, 1MB blocks)? Thoughts? (side note: can someone explain the "size xxx on disk, xxx in core" statements in that zdb output for me? The numbers never seem related to the number of entries or .... anything.)
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss