On 21/12/2010 05:44, Richard Elling wrote:
On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.har...@gmail.com <mailto:phil.har...@gmail.com>> wrote:
On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com <mailto:phil.har...@gmail.com>> wrote:
Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made it more broken.
"broken" is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.
It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread.
If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...

The context of this thread is a budget home media server (certainly not the Indy 500, but perhaps not as humble as tricycle touring either). And whilst it is a habit of the hardware advocate to blame the software ... and vice versa ... it's not much help to those of us trying to build "good enough" systems across the performance and availability spectrum.

I think we can agree that ZFS currently doesn't play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions.
... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.

I'd love to see the data and analysis for the assertion that "most files systems are nowhere near full", discounting, of course, any trivial cases. In my experience, in any cost conscious scenario, in the home or the enterprise, the expectation is that I'll get to use the majority of the space I've paid for (generally "through the nose" from the storage silo team in the enterprise scenario). To borrow your illustration, even Indy 500 teams care about fuel consumption.

What I don't appreciate is having to resilver significantly more data than the drive can contain. But when it comes to the crunch, what I'd really appreciate was a bounded resilver time measured in hours not days or weeks.

For a long time at Sun, the rule was "correctness is a constraint, performance is a goal". However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be "wrong").

Then one brave soul at Sun once ventured that "if Linux is faster, it's a Solaris bug!" and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID "where I = inexpensive", so I'm a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it's SATA (and I'm not so sure).
"slow" doesn't begin with an "i" :-)

Both ZFS and RAID promised to play in the inexpensive space.

There has been a lot of discussion, anecdotes and some data on this list.
"slow because I use devices with poor random write(!) performance"
is very different than "broken."
Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I'd be the first to say "are you nuts?!"
Unfortunately, the math does not support your position...

Actually, the math (e.g. raw drive metrics) doesn't lead me to expect such a disparity.

The resilver doesn't do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata.
A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.
Actually, it's easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs.
However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool.
Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.
I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any.
As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed.
I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.
I don't share your disbelief or "little difference" analysys. If it is true that no current implementation succeeds, isn't that a great opportunity to change the rules? Wasn't resilver time vs availability was a major factor in Adam Leventhal's paper introducing the need for RAIDZ3?

No, it wasn't.

Maybe we weren't reading the same paper?

From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a pointer to Adam's ACM article)
The need for triple-parity RAID
...
The time to populate a drive is directly relevant for RAID rebuild. As disks in RAID systems take longer to reconstruct, the reliability of the total system decreases due to increased periods running in a degraded state. Today that can be four hours or longer; that could easily grow to days or weeks.

From http://queue.acm.org/detail.cfm?id=1670144 (Adam's ACM article)
While bit error rates have nearly kept pace with the growth in disk capacity, throughput has not been given its due consideration when determining RAID reliability.

Whilst Adam does discuss the lack of progress in bit error rates, his focus (in the article, and in his pointer to it) seems to be on drive capacity vs data rates, how this impact recovery times, and the consequential need to protect against multiple overlapping failures.

There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)

Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.

By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec'ed at 1 error per 1e14
bits read.  To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit
on a 2TB is growing well above 10%. Some of the better enterprise class
HDDs are rated two orders of magnitude better, but the only way to get
much better is to use more bits for ECC... hence the move towards
4KB sectors.

In other words, the probability of losing data by reading data can be
larger than losing data next year. This is the case for triple parity RAID.

MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when a disk fails, it is not considered "repairable", so a better metric is MTTF (because there are no repairable failures)]

1.6 million hours equates to about 180 years, so why do HDD vendors guarantee their drives for considerably less (typically 3-5 years)? It's because they base the figure on a constant failure rate expected during the normal useful life of the drive (typically 5 years).

However, quoting from http://www.asknumbers.com/WhatisReliability.aspx
Field failures do not generally occur at a uniform rate, but follow a distribution in time commonly described as a "bathtub curve." The life of a device can be divided into three regions: Infant Mortality Period, where the failure rate progressively improves; Useful Life Period, where the failure rate remains constant; and Wearout Period, where failure rates begin to increase.

Crucially, the vendor's quoted MTBF figures do not take into account "infant mortality" or early "wearout". Until every HDD is fitted with an environmental tell-tale device for shock, vibration, temperature, pressure, humidity, etc we can't even come close to predicting either factor.

And this is just the HDD itself. In a system there are many ways to lose access to an HDD. So I'm exposed when I lose the first drive in a RAIDZ1 (second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer the resilver takes, the longer I'm exposed.

Add to the mix that Indy 500 drives can degrade to tricyle performance before they fail utterly, and yes, low performing drives can still be an issue, even for the elite.

The appropriateness or otherwise of resilver throttling depends on the context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if I can recover business critical data in a timely manner, then great. But there may come a point where I would rather take a short term performance hit to close the window on total data loss.

I agree. Back in the bad old days, we were stuck with silly throttles
on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
on the competing, non-scrub I/O. This works because in ZFS all I/O is not
created equal, unlike the layered RAID implementations such as SVM or
RAID arrays. ZFS schedules the regular workload at a higher priority than
scrubs or resilvers. Add the new throttles and the scheduler is even more
effective. So you get your interactive performance at the cost of longer
resilver times. This is probably a good trade-off for most folks.

The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler.

Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.

This only holds in a quiesced system.

The effect will be worse for a mirror because you have direct
competition for the single, surviving HDD. For raidz*, we clearly
see the read workload spread out across the surving disks at
approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
then a resilver will keep the resilvering disk 100% busy writing, and
the data disks approximately 25% busy reading. Later releases of
ZFS will also prefetch the reads and the writes can be coalesced,
skewing the ratio a little bit, but the general case seems to be a
reasonable starting point.

Mirrored systems need more drives to achieve the same capacity, so mirrored volumes are generally striped by some means, so the equivalent of your 4+1 RAIDZ1 is actually a 4+4. In such a configuration resilvering one drive at 100% would also result in a mean hit of 25%.

Obviously, a drive running at 100% has nothing more to give, so for fun let's throttle the resilver to 25x1MB sequential reads per second (which is about 25% of a good drive's capacity). At this rate, a 2TB drive will resilver in under 24 hours, so let's make that the upper bound.

It is highly desirable to throttle the resilver and regular I/O rates according to required performance and availability metrics, so something better than 24 hours should be the norm.

It should also be possible for the system to report an ETA based on current and historic workload statistics. "You may say I'm a dreamer..."

For mirrored vdevs, ZFS could resilver using an efficient block level copy, whilst keeping a record of progress, and considering copied blocks as already mirrored and ready to be read and updated by normal activity. Obviously, it's much harder to apply this approach for RAIDZ.

Since slabs are allocated sequentially, it should also be possible to set a high water mark for the bulk copy, so that fresh pools with little or no data could also be resilvered in minutes or seconds.

I believe such an approach would benefit all ZFS users, not just the elite.

 -- richard

Phil

p.s. just for the record, Nexenta's Hardware Supported List (HSL) is an excellent resource for those wanting to build NAS appliances that actually work...

   http://www.nexenta.com/corp/supported-hardware/hardware-supported-list

... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs (enterprise class drives at near consumer prices)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to