Thanks, Chris, for digging into this and sharing your results.  These
seemingly stranded sectors are actually properly accounted for in terms
of space utilization, since they are actually unusable while maintaining
integrity in the face of a single drive failure.

The way the RAID-Z space accounting works is this:

    1) Take the size of your data block (4k in your example) and figure
       out how much parity you need to protect it.  This turns out to be 
       3 sectors, for a total of 11 (5.5k).  See vdev_raidz_asize() for
       details.
    2) For single-parity RAID-Z, round up to a multiple of 2 sectors,
       and for double-parity RAID-Z, round up to a multiple of 3
       sectors.  This becomes ASIZE (6k in your case).  The reason
       for this is a bit complicated, but without this roundup, you can
       end up with stranded sectors that are unallocated and unusable,
       leading to the question, "I still have free space, why can't I
       write a file?"  We simply account for for these roundup sectors
       as part of the allocation that caused them.
    3) Allocate space for ASIZE bytes from the RAID-Z space map.  With
       the first-fit allocator, this aligns the write to the greatest
       power of 2 that evenly divides ASIZE (2k in this case).

With all this in mind, what winds up happening is exactly what Chris
surmised.  In this illustration, "A" represents a single sector of data
and "A." indicates its parity.

        Disk   A   B   C   D
        --------------------
     LBA   0   A.  A   A   A
           1   A.  A   A   A
           2   A.  A   A   X
           3   B.  B   B   B
           4   B.  B   B   B
           5   B.  B   B   X

And so forth.  In this scenario, you wind up with the described
situation of non-continuous writes on one of the disks, which will kill
the performance.  Sorry about that.  Jeff and I had actually talked at
one point about how we could fix this.  Basically, you could represent
the "X" dead sector as an opportunistic write that would only get sent
to disk if it got aggregated, and would get dropped on the floor
otherwise.  I think it wouldn't be too bad with some pipeline tricks.

If anyone is interested enough to pick this up, let me know and we can
discuss the details.


--Bill

On Tue, Sep 26, 2006 at 07:43:34PM -0500, Chris Csanady wrote:
> On 9/26/06, Richard Elling - PAE <[EMAIL PROTECTED]> wrote:
> >Chris Csanady wrote:
> >> What I have observed with the iosnoop dtrace script is that the
> >> first disks aggregate the single block writes, while the last disk(s)
> >> are forced to do numerous writes every other sector.  If you would
> >> like to reproduce this, simply copy a large file to a recordsize=4k
> >> filesystem on a 4 disk RAID-Z.
> >
> >Why would I want to set recordsize=4k if I'm using large files?
> >For that matter, why would I ever want to use a recordsize=4k, is
> >there a database which needs 4k record sizes?
> 
> Sorry, I wasn't very clear about the reasoning for this.  It is not
> something that you would normally do, but it generates just
> the right combination of block size and stripe width to make the
> problem very apparent.
> 
> It is also possible to encounter this on a filesystem with the
> default recordsize, and I have observed the effect while extracting
> a large archive of sources.  Still, it was never bad enough for my
> uses to be anything more than a curiosity.  However, while trying
> to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo
> seemingly stumbled upon this worst case performance scenerio.
> (Though, unlike my example, it is also possible to end up with
> holes in the second column.)
> 
> Also, while it may be a small error, could these stranded sectors
> throw off the space accounting enough to cause problems when
> a pool is nearly full?
> 
> Chris
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to