this is not a problem we're trying to solve, but part of a characterization study of the zfs implementation ... we're currently using the default 8KB blocksize for our zvol deployment, and we're performing tests using write block sizes as small as 4KB and as large as 1MB as previously described (including an 8KB write aligned to logical zvol block zero, for a perfect match to the zvol blocksize) ... in all cases we see at least twice the IO to the disks than we generate from our test program (and it's much worse for smaller write block sizes) ... we're not exactly caught in read-modify-write hell (except when we write the 4KB blocks that are smaller than the zvol blocksize), it's more like modify-write hell since the original meta-data that maps the 2GB region we're writing is probably just read once and kept in cache for the duration of the test ... the large amount of back-end IO is almost entirely write operations, but these write operations include the re-writing of meta-data that has to change to reflect the re-location of newly written data (remember, no in-place writes ever occur for data or meta-data) ... using the default zvol block size of 8KB, zfs requires, in just block-pointer meta-data, about 1.5% of the total 2GB write region (this is a large percentage vs other file systems like ufs, for example, because zfs uses a 128 byte block pointer vs a ufs 8 byte block pointer) ... as new data is written over the old data, the leaves of the meta-data tree are necessarily changed to point to the new locations on disk of the new data, but any new leaf block-pointer requires that a new block of leaf pointers be allocated and written, which requires that the next indirect level up from these leaves point to this new set of leaf pointers, so it must be rewritten itself, and so on up the tree (and remember, meta-data is subject to being written in up to 3 copies - default is 2 - anytime any of it is written to disk) ... the indirect pointer blocks closer to the root of the tree may only see a single pointer change over the course of a 5 second consolidation (based on the size of the zvol, the size of the block allocation unit in the zvol and the amount of data actually written to the zvol in 5 seconds), but a complete new indirect block must be created and written to disk (all the way back to the uberblock) on each transaction group write ... this means that some of these meta-data blocks are written to disk over and over again with only small changes from their previous composition ... consolidating for more than 5 seconds would help to mitigate this situation, but longer consolidation periods put more data at risk of being lost in case of a power failure ... this is not particularly a problem, just a manifestation of the need to never write in-place, a rather large block pointer size and the possible writing of multiple copies of meta-data (of course this block pointer carries check sums, and the addresses of up to 3 duplicate blocks, providing the excellent data and meta-data protection zfs is so well known for) ... the original thread that this reply addressed was the characteristic 5 second delay in writes, which I tried to explain in the context of copy-on-write consolidation, but it's clear that even this delay cannot prevent the modification and re-writing of the same basic meta-data many times with small modifications This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss