Erik Trimble wrote:
 On 9/9/2010 2:15 AM, taemun wrote:
Erik: does that mean that keeping the number of data drives in a raidz(n) to a power of two is better? In the example you gave, you mentioned 14kb being written to each drive. That doesn't sound very efficient to me.

(when I say the above, I mean a five disk raidz or a ten disk raidz2, etc)

Cheers,


Well, since the size of a slab can vary (from 512 bytes to 128k), it's hard to say. Length (size) of the slab is likely the better determination. Remember each block on a hard drive is 512 bytes (for now). So, it's really not any more efficient to write 16k than 14k (or vice versa). Both are integer multiples of 512 bytes.

IIRC, there was something about using a power-of-two number of data drives in a RAIDZ, but I can't remember what that was. It may just be a phantom memory.

Not a phantom memory...

From Matt Ahrens in a thread titled 'Metaslab alignment on RAID-Z':
http://www.opensolaris.org/jive/thread.jspa?messageID=60241
'To eliminate the blank "round up" sectors for power-of-two blocksizes of 8k or larger, you should use a power-of-two plus 1 number of disks in your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 6) and for 2k, use 3 disks (for double parity, use 4).'


These round up sectors are skipped and used as padding to simplify space accounting and improve performance. I may have referred to them as zero padding sectors in other posts, however they're not necessarily zeroed.

See the thread titled 'raidz stripe size (not stripe width)' http://opensolaris.org/jive/thread.jspa?messageID=495351


This looks to be the reasoning behind the optimization in the ZFS Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.
I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

The best practices guide recommendation of 3-9 devices per vdev appears based on RAIDZ1's optimal size with 3-9 devices when N=1 to 3 in 2^N + P.

Victor Latushkin in a thread titled 'odd versus even' said the same thing. Adam Leventhal said this had a 'very slight space-efficiency benefit' in the same thread.
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg05460.html

---
That said, the recommendations in the Best Practices Guide for RAIDZ2 to start with 5 disks and RAIDZ3 to start with 8 disks, do not match with the last recommendation. What is the reasoning behind 5 and 8? Reliability vs space?
Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2)
Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8


Perhaps the Best Practices Guide should also recommend:
-the use of striped vdevs in order to bring up the IOPS number, particularly when using enough hard drives to meet the capacity and reliability requirements.
-avoiding slow consumer class drives (fast ones may be okay for some users)
-more sample array configurations for common drive chassis capacities
-consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather than higher level RAIDZ or mirroring (touch on the value of backup vs. stronger RAIDZ) -watch out for BIOS or firmware upgrades that change host protected area (HPA) settings on drives making them appear smaller than before

The BPG should also resolve this discrepancy:
Storage Pools section: "For production systems, use whole disks rather than slices for storage pools for the following reasons" Additional Cautions for Storage Pools: "Consider planning ahead and reserving some space by creating a slice which is smaller than the whole disk instead of the whole disk."

---


Other (somewhat) related threads:


From Darren Dunham in a thread titled 'ZFS raidz2 number of disks':
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265
'> 1 Why is the recommendation for a raidz2 3-9 disk, what are the cons for having 16 in a pool compared to 8? Reads potentially have to pull data from all data columns to reconstruct a filesystem block for verification. For random read workloads, increasing the number of columns in the raidz does not increase the read iops. So limiting the column count usually makes sense (with a cost tradeoff). 16 is valid, but not recommended.'



From Richard Relling in a thread titled 'rethinking RaidZ and Record size':
http://opensolaris.org/jive/thread.jspa?threadID=121016
'The raidz pathological worst case is a random read from many-column raidz where files have records 128 KB in size. The inflated read problem is why it makes sense to match recordsize for fixed record workloads. This includes CIFS workloads which use 4 KB records. It is also why having many columns in the raidz for large records does not improve performance. Hence the 3 to 9 raidz disk limit recommendation in the zpool man page.'



From Adam Leventhal in a thread titled 'Problem with RAID-Z in builds snv_120 - snv_123':
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28907.html
'Basically, RAID-Z writes full stripes every time; note that without careful accounting it would be possible to effectively fragment the vdev such that single sectors were free but useless since single-parity RAID-Z requires two adjacent sectors to store data (one for data, one for parity). To address this, RAID-Z rounds up its allocation to the next (nparity + 1). This ensures that all space is accounted for. RAID-Z will thus skip sectors that are unused based on this rounding. For example, under raidz1 a write of 1024 bytes would result in 512 bytes of parity, 512 bytes of data on two devices and 512 bytes skipped.

To improve performance, ZFS aggregates multiple adjacent IOs into a single large IO. Further, hard drives themselves can perform aggregation of adjacent IOs. We noted that these skipped sectors were inhibiting performance so added "optional" IOs that could be used to improve aggregation. This yielded a significant performance boost for all RAID-Z configurations.'



From Adam Leventhal in a thread titled 'triple-parity: RAID-Z3':
http://opensolaris.org/jive/thread.jspa?threadID=108154
'> So I'm not sure what the 'RAID-Z should mind the gap on writes'
> comment is getting at either.
>
> Clarification?

I'm planning to write a blog post describing this, but the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped. ZFS generally lays down data in large contiguous streams, but these skipped sectors can stymie both ZFS's write aggregation as well as the hard drive's ability to group I/Os and write them quickly.

Jeff Bonwick added some code to mind these gaps on reads. The key insight there is that if we're going to read 64K, say, with a 512 byte hole in the middle, we might as well do one big read rather than two smaller reads and just throw out the data that we don't care about.

Of course, doing this for writes is a bit trickier since we can't just blithely write over gaps as those might contain live data on the disk. To solve this we push the knowledge of those skipped sectors down to the I/O aggregation layer in the form of 'optional' I/Os purely for the purpose of coalescing writes into larger chunks.'

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to