On 04/17/2013 12:08 AM, Richard Elling wrote: > clarification below... > > On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov <skiselkov...@gmail.com> wrote: > >> On 04/16/2013 11:37 PM, Timothy Coalson wrote: >>> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov >>> <skiselkov...@gmail.com>wrote: >>> >>>> If you are IOPS constrained, then yes, raid-zn will be slower, simply >>>> because any read needs to hit all data drives in the stripe. This is >>>> even worse on writes if the raidz has bad geometry (number of data >>>> drives isn't a power of 2). >>>> >>> >>> Off topic slightly, but I have always wondered at this - what exactly >>> causes non-power of 2 plus number of parities geometries to be slower, and >>> by how much? I tested for this effect with some consumer drives, comparing >>> 8+2 and 10+2, and didn't see much of a penalty (though the only random test >>> I did was read, our workload is highly sequential so it wasn't important). > > This makes sense, even for more random workloads. > >> >> Because a non-power-of-2 number of drives causes a read-modify-write >> sequence on every (almost) write. HDDs are block devices and they can >> only ever write in increments of their sector size (512 bytes or >> nowadays often 4096 bytes). Using your example above, you divide a 128k >> block by 8, you get 8x16k updates - all nicely aligned on 512 byte >> boundaries, so your drives can write that in one go. If you divide by >> 10, you get an ugly 12.8k, which means if your drives are of the >> 512-byte sector variety, they write 24x 512 sectors and then for the >> last partial sector write, they first need to fetch the sector from the >> patter, modify if in memory and then write it out again. > > This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has > been > a few years, I did a bunch of tests and found no correlation between the > number > of disks in the set (within boundaries as described in the man page) and > random > performance for raidz. This is not the case for RAID-5/6 where pathologically > bad performance is easy to create if you know the number of disks and stripe > width. > -- richard
You are right, and I think I already know where I went wrong, though I'll need to check raidz_map_alloc to confirm. If memory serves me right, raidz actually splits the I/O up so that each stripe component is simply length-aligned and padded out to complete a full sector (otherwise the zio_vdev_child_io would fail in a block-alignment assertion in zio_create here: zio_create(zio_t *pio, spa_t *spa,... { .. ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0); .. I was probably misremembering the power-of-2 rule from a discussion about 4k sector drives. There the amount of wasted space can be significant, especially on small-block data, e.g. the default 8k volblocksize not being able to scale beyond 2 data drives + parity. Cheers, -- Saso _______________________________________________ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss