Re: [zfs-discuss] Suggested RaidZ configuration...

Haudy Kazemi Thu, 09 Sep 2010 16:15:01 -0700

Erik Trimble wrote:

 On 9/9/2010 2:15 AM, taemun wrote:
Erik: does that mean that keeping the number of data drives in araidz(n) to a power of two is better? In the example you gave, youmentioned 14kb being written to each drive. That doesn't sound veryefficient to me.
(when I say the above, I mean a five disk raidz or a ten disk raidz2,etc)
Cheers,
Well, since the size of a slab can vary (from 512 bytes to 128k), it'shard to say. Length (size) of the slab is likely the betterdetermination. Remember each block on a hard drive is 512 bytes (fornow). So, it's really not any more efficient to write 16k than 14k(or vice versa). Both are integer multiples of 512 bytes.
IIRC, there was something about using a power-of-two number of datadrives in a RAIDZ, but I can't remember what that was. It may just bea phantom memory.


Not a phantom memory...

From Matt Ahrens in a thread titled 'Metaslab alignment on RAID-Z':
http://www.opensolaris.org/jive/thread.jspa?messageID=60241

'To eliminate the blank "round up" sectors for power-of-two blocksizesof 8k or larger, you should use a power-of-two plus 1 number of disks inyour raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use apower-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes aremore constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or6) and for 2k, use 3 disks (for double parity, use 4).'

These round up sectors are skipped and used as padding to simplify spaceaccounting and improve performance. I may have referred to them as zeropadding sectors in other posts, however they're not necessarily zeroed.

See the thread titled 'raidz stripe size (not stripe width)'http://opensolaris.org/jive/thread.jspa?messageID=495351

This looks to be the reasoning behind the optimization in the ZFS BestPractices Guide that says the number of devices in a vdev should be(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.

I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

The best practices guide recommendation of 3-9 devices per vdev appearsbased on RAIDZ1's optimal size with 3-9 devices when N=1 to 3 in 2^N + P.

Victor Latushkin in a thread titled 'odd versus even' said the samething. Adam Leventhal said this had a 'very slight space-efficiencybenefit' in the same thread.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg05460.html

---

That said, the recommendations in the Best Practices Guide for RAIDZ2 tostart with 5 disks and RAIDZ3 to start with 8 disks, do not match withthe last recommendation. What is the reasoning behind 5 and 8?Reliability vs space?

Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2)
Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8


Perhaps the Best Practices Guide should also recommend:

-the use of striped vdevs in order to bring up the IOPS number,particularly when using enough hard drives to meet the capacity andreliability requirements.

-avoiding slow consumer class drives (fast ones may be okay for some users)
-more sample array configurations for common drive chassis capacities

-consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather thanhigher level RAIDZ or mirroring (touch on the value of backup vs.stronger RAIDZ)-watch out for BIOS or firmware upgrades that change host protected area(HPA) settings on drives making them appear smaller than before


The BPG should also resolve this discrepancy:

Storage Pools section: "For production systems, use whole disks ratherthan slices for storage pools for the following reasons"Additional Cautions for Storage Pools: "Consider planning ahead andreserving some space by creating a slice which is smaller than the wholedisk instead of the whole disk."


---


Other (somewhat) related threads:


From Darren Dunham in a thread titled 'ZFS raidz2 number of disks':
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265

'> 1 Why is the recommendation for a raidz2 3-9 disk, what are the consfor having 16 in a pool compared to 8?Reads potentially have to pull data from all data columns to reconstructa filesystem block for verification. For random read workloads,increasing the number of columns in the raidz does not increase the readiops. So limiting the column count usually makes sense (with a costtradeoff). 16 is valid, but not recommended.'




From Richard Relling in a thread titled 'rethinking RaidZ and Record size':
http://opensolaris.org/jive/thread.jspa?threadID=121016

'The raidz pathological worst case is a random read from many-columnraidz where files have records 128 KB in size. The inflated read problemis why it makes sense to match recordsize for fixed record workloads.This includes CIFS workloads which use 4 KB records. It is also whyhaving many columns in the raidz for large records does not improveperformance. Hence the 3 to 9 raidz disk limit recommendation in thezpool man page.'

From Adam Leventhal in a thread titled 'Problem with RAID-Z in buildssnv_120 - snv_123':

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28907.html

'Basically, RAID-Z writes full stripes every time; note that withoutcareful accounting it would be possible to effectively fragment the vdevsuch that single sectors were free but useless since single-parityRAID-Z requires two adjacent sectors to store data (one for data, onefor parity). To address this, RAID-Z rounds up its allocation to thenext (nparity + 1). This ensures that all space is accounted for. RAID-Zwill thus skip sectors that are unused based on this rounding. Forexample, under raidz1 a write of 1024 bytes would result in 512 bytes ofparity, 512 bytes of data on two devices and 512 bytes skipped.

To improve performance, ZFS aggregates multiple adjacent IOs into asingle large IO. Further, hard drives themselves can perform aggregationof adjacent IOs. We noted that these skipped sectors were inhibitingperformance so added "optional" IOs that could be used to improveaggregation. This yielded a significant performance boost for all RAID-Zconfigurations.'




From Adam Leventhal in a thread titled 'triple-parity: RAID-Z3':
http://opensolaris.org/jive/thread.jspa?threadID=108154
'> So I'm not sure what the 'RAID-Z should mind the gap on writes'
> comment is getting at either.
>
> Clarification?

I'm planning to write a blog post describing this, but the basic problemis that RAID-Z, by virtue of supporting variable stripe writes (theinsight that allows us to avoid the RAID-5 write hole), must round thenumber of sectors up to a multiple of nparity+1. This means that we mayhave sectors that are effectively skipped. ZFS generally lays down datain large contiguous streams, but these skipped sectors can stymie bothZFS's write aggregation as well as the hard drive's ability to groupI/Os and write them quickly.

Jeff Bonwick added some code to mind these gaps on reads. The keyinsight there is that if we're going to read 64K, say, with a 512 bytehole in the middle, we might as well do one big read rather than twosmaller reads and just throw out the data that we don't care about.

Of course, doing this for writes is a bit trickier since we can't justblithely write over gaps as those might contain live data on the disk.To solve this we push the knowledge of those skipped sectors down to theI/O aggregation layer in the form of 'optional' I/Os purely for thepurpose of coalescing writes into larger chunks.'


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

Reply via email to