Thanks, Chris, for digging into this and sharing your results. These seemingly stranded sectors are actually properly accounted for in terms of space utilization, since they are actually unusable while maintaining integrity in the face of a single drive failure.
The way the RAID-Z space accounting works is this: 1) Take the size of your data block (4k in your example) and figure out how much parity you need to protect it. This turns out to be 3 sectors, for a total of 11 (5.5k). See vdev_raidz_asize() for details. 2) For single-parity RAID-Z, round up to a multiple of 2 sectors, and for double-parity RAID-Z, round up to a multiple of 3 sectors. This becomes ASIZE (6k in your case). The reason for this is a bit complicated, but without this roundup, you can end up with stranded sectors that are unallocated and unusable, leading to the question, "I still have free space, why can't I write a file?" We simply account for for these roundup sectors as part of the allocation that caused them. 3) Allocate space for ASIZE bytes from the RAID-Z space map. With the first-fit allocator, this aligns the write to the greatest power of 2 that evenly divides ASIZE (2k in this case). With all this in mind, what winds up happening is exactly what Chris surmised. In this illustration, "A" represents a single sector of data and "A." indicates its parity. Disk A B C D -------------------- LBA 0 A. A A A 1 A. A A A 2 A. A A X 3 B. B B B 4 B. B B B 5 B. B B X And so forth. In this scenario, you wind up with the described situation of non-continuous writes on one of the disks, which will kill the performance. Sorry about that. Jeff and I had actually talked at one point about how we could fix this. Basically, you could represent the "X" dead sector as an opportunistic write that would only get sent to disk if it got aggregated, and would get dropped on the floor otherwise. I think it wouldn't be too bad with some pipeline tricks. If anyone is interested enough to pick this up, let me know and we can discuss the details. --Bill On Tue, Sep 26, 2006 at 07:43:34PM -0500, Chris Csanady wrote: > On 9/26/06, Richard Elling - PAE <[EMAIL PROTECTED]> wrote: > >Chris Csanady wrote: > >> What I have observed with the iosnoop dtrace script is that the > >> first disks aggregate the single block writes, while the last disk(s) > >> are forced to do numerous writes every other sector. If you would > >> like to reproduce this, simply copy a large file to a recordsize=4k > >> filesystem on a 4 disk RAID-Z. > > > >Why would I want to set recordsize=4k if I'm using large files? > >For that matter, why would I ever want to use a recordsize=4k, is > >there a database which needs 4k record sizes? > > Sorry, I wasn't very clear about the reasoning for this. It is not > something that you would normally do, but it generates just > the right combination of block size and stripe width to make the > problem very apparent. > > It is also possible to encounter this on a filesystem with the > default recordsize, and I have observed the effect while extracting > a large archive of sources. Still, it was never bad enough for my > uses to be anything more than a curiosity. However, while trying > to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo > seemingly stumbled upon this worst case performance scenerio. > (Though, unlike my example, it is also possible to end up with > holes in the second column.) > > Also, while it may be a small error, could these stranded sectors > throw off the space accounting enough to cause problems when > a pool is nearly full? > > Chris > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss