Re: [zfs-discuss] A few questions

Phil Harman Tue, 21 Dec 2010 03:50:32 -0800

On 21/12/2010 05:44, Richard Elling wrote:

On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.har...@gmail.com<mailto:phil.har...@gmail.com>> wrote:
On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com<mailto:phil.har...@gmail.com>> wrote:
Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that madeit more broken.
"broken" is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.
It might be the wrong term in general, but I think it does apply inthe budget home media server context of this thread.
If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...

The context of this thread is a budget home media server (certainly notthe Indy 500, but perhaps not as humble as tricycle touring either). Andwhilst it is a habit of the hardware advocate to blame the software ...and vice versa ... it's not much help to those of us trying to build"good enough" systems across the performance and availability spectrum.

I think we can agree that ZFS currently doesn't play well on cheapdisks. I think we can also agree that the performance of ZFSresilvering is known to be suboptimal under certain conditions.
... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.

I'd love to see the data and analysis for the assertion that "most filessystems are nowhere near full", discounting, of course, any trivialcases. In my experience, in any cost conscious scenario, in the home orthe enterprise, the expectation is that I'll get to use the majority ofthe space I've paid for (generally "through the nose" from the storagesilo team in the enterprise scenario). To borrow your illustration, evenIndy 500 teams care about fuel consumption.

What I don't appreciate is having to resilver significantly more datathan the drive can contain. But when it comes to the crunch, what I'dreally appreciate was a bounded resilver time measured in hours not daysor weeks.

For a long time at Sun, the rule was "correctness is a constraint,performance is a goal". However, in the real world, performance isoften also a constraint (just as a quick but erroneous answer is awrong answer, so also, a slow but correct answer can also be "wrong").
Then one brave soul at Sun once ventured that "if Linux is faster,it's a Solaris bug!" and to his surprise, the idea caught on. I laterwent on to tell people that ZFS delievered RAID "where I =inexpensive", so I'm a just a little frustrated when that promisebecomes less respected over time. First it was USB drives (which Iagreed with), now it's SATA (and I'm not so sure).
"slow" doesn't begin with an "i" :-)


Both ZFS and RAID promised to play in the inexpensive space.

There has been a lot of discussion, anecdotes and some data on thislist.
"slow because I use devices with poor random write(!) performance"
is very different than "broken."
Again, context is everything. For example, if someone was building abusiness critical NAS appliance from consumer grade parts, I'd be thefirst to say "are you nuts?!"
Unfortunately, the math does not support your position...

Actually, the math (e.g. raw drive metrics) doesn't lead me to expectsuch a disparity.

The resilver doesn't do a single pass of the drives, but uses a"smarter" temporal algorithm based on metadata.
A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.
Actually, it's easy to see how a combined spatial and temporalapproach could be implemented to an advantage for mirrored vdevs.
However, the current implentation has difficulty finishing the jobif there's a steady flow of updates to the pool.
Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.
I was led to believe this was not yet fixed in Solaris 11, and thatthere are therefore doubts about what Solaris 10 update may see thefix, if any.
As far as I'm aware, the only way to get bounded resilver times isto stop the workload until resilvering is completed.
I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.
I don't share your disbelief or "little difference" analysys. If itis true that no current implementation succeeds, isn't that a greatopportunity to change the rules? Wasn't resilver time vs availabilitywas a major factor in Adam Leventhal's paper introducing the need forRAIDZ3?
No, it wasn't.


Maybe we weren't reading the same paper?

From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (apointer to Adam's ACM article)

The need for triple-parity RAID
...
The time to populate a drive is directly relevant for RAID rebuild. Asdisks in RAID systems take longer to reconstruct, the reliability ofthe total system decreases due to increased periods running in adegraded state. Today that can be four hours or longer; that couldeasily grow to days or weeks.


From http://queue.acm.org/detail.cfm?id=1670144 (Adam's ACM article)

While bit error rates have nearly kept pace with the growth in diskcapacity, throughput has not been given its due consideration whendetermining RAID reliability.

Whilst Adam does discuss the lack of progress in bit error rates, hisfocus (in the article, and in his pointer to it) seems to be on drivecapacity vs data rates, how this impact recovery times, and theconsequential need to protect against multiple overlapping failures.

There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)

Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.

By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec'ed at 1 error per 1e14
bits read.  To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit
on a 2TB is growing well above 10%. Some of the better enterprise class
HDDs are rated two orders of magnitude better, but the only way to get
much better is to use more bits for ECC... hence the move towards
4KB sectors.

In other words, the probability of losing data by reading data can be

larger than losing data next year. This is the case for triple parityRAID.

MTBF as quoted by HDD vendors has become pretty meaningless. [nit: whena disk fails, it is not considered "repairable", so a better metric isMTTF (because there are no repairable failures)]

1.6 million hours equates to about 180 years, so why do HDD vendorsguarantee their drives for considerably less (typically 3-5 years)? It'sbecause they base the figure on a constant failure rate expected duringthe normal useful life of the drive (typically 5 years).


However, quoting from http://www.asknumbers.com/WhatisReliability.aspx

Field failures do not generally occur at a uniform rate, but follow adistribution in time commonly described as a "bathtub curve." The lifeof a device can be divided into three regions: Infant MortalityPeriod, where the failure rate progressively improves; Useful LifePeriod, where the failure rate remains constant; and Wearout Period,where failure rates begin to increase.

Crucially, the vendor's quoted MTBF figures do not take into account"infant mortality" or early "wearout". Until every HDD is fitted with anenvironmental tell-tale device for shock, vibration, temperature,pressure, humidity, etc we can't even come close to predicting eitherfactor.

And this is just the HDD itself. In a system there are many ways to loseaccess to an HDD. So I'm exposed when I lose the first drive in a RAIDZ1(second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longerthe resilver takes, the longer I'm exposed.

Add to the mix that Indy 500 drives can degrade to tricyle performancebefore they fail utterly, and yes, low performing drives can still be anissue, even for the elite.

The appropriateness or otherwise of resilver throttling depends onthe context. If I can tolerate further failures without data loss(e.g. RAIDZ2 with one failed device, or RAIDZ3 with two faileddevices), or if I can recover business critical data in a timelymanner, then great. But there may come a point where I would rathertake a short term performance hit to close the window on total data loss.


I agree. Back in the bad old days, we were stuck with silly throttles
on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
on the competing, non-scrub I/O. This works because in ZFS all I/O is not
created equal, unlike the layered RAID implementations such as SVM or
RAID arrays. ZFS schedules the regular workload at a higher priority than
scrubs or resilvers. Add the new throttles and the scheduler is even more
effective. So you get your interactive performance at the cost of longer
resilver times. This is probably a good trade-off for most folks.

The problem exists for mirrors too, but is not as marked becausemirror reconstruction is inherently simpler.
Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.
This only holds in a quiesced system.


The effect will be worse for a mirror because you have direct
competition for the single, surviving HDD. For raidz*, we clearly
see the read workload spread out across the surving disks at
approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
then a resilver will keep the resilvering disk 100% busy writing, and
the data disks approximately 25% busy reading. Later releases of
ZFS will also prefetch the reads and the writes can be coalesced,
skewing the ratio a little bit, but the general case seems to be a
reasonable starting point.

Mirrored systems need more drives to achieve the same capacity, somirrored volumes are generally striped by some means, so the equivalentof your 4+1 RAIDZ1 is actually a 4+4. In such a configurationresilvering one drive at 100% would also result in a mean hit of 25%.

Obviously, a drive running at 100% has nothing more to give, so for funlet's throttle the resilver to 25x1MB sequential reads per second (whichis about 25% of a good drive's capacity). At this rate, a 2TB drive willresilver in under 24 hours, so let's make that the upper bound.

It is highly desirable to throttle the resilver and regular I/O ratesaccording to required performance and availability metrics, so somethingbetter than 24 hours should be the norm.

It should also be possible for the system to report an ETA based oncurrent and historic workload statistics. "You may say I'm a dreamer..."

For mirrored vdevs, ZFS could resilver using an efficient block levelcopy, whilst keeping a record of progress, and considering copied blocksas already mirrored and ready to be read and updated by normal activity.Obviously, it's much harder to apply this approach for RAIDZ.

Since slabs are allocated sequentially, it should also be possible toset a high water mark for the bulk copy, so that fresh pools with littleor no data could also be resilvered in minutes or seconds.


I believe such an approach would benefit all ZFS users, not just the elite.

 -- richard


Phil

p.s. just for the record, Nexenta's Hardware Supported List (HSL) is anexcellent resource for those wanting to build NAS appliances thatactually work...


   http://www.nexenta.com/corp/supported-hardware/hardware-supported-list

... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs(enterprise class drives at near consumer prices)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Reply via email to