On 21/12/2010 05:44, Richard Elling wrote:
On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.har...@gmail.com
<mailto:phil.har...@gmail.com>> wrote:
On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com
<mailto:phil.har...@gmail.com>> wrote:
Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made
it more broken.
"broken" is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.
It might be the wrong term in general, but I think it does apply in
the budget home media server context of this thread.
If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...
The context of this thread is a budget home media server (certainly not
the Indy 500, but perhaps not as humble as tricycle touring either). And
whilst it is a habit of the hardware advocate to blame the software ...
and vice versa ... it's not much help to those of us trying to build
"good enough" systems across the performance and availability spectrum.
I think we can agree that ZFS currently doesn't play well on cheap
disks. I think we can also agree that the performance of ZFS
resilvering is known to be suboptimal under certain conditions.
... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.
I'd love to see the data and analysis for the assertion that "most files
systems are nowhere near full", discounting, of course, any trivial
cases. In my experience, in any cost conscious scenario, in the home or
the enterprise, the expectation is that I'll get to use the majority of
the space I've paid for (generally "through the nose" from the storage
silo team in the enterprise scenario). To borrow your illustration, even
Indy 500 teams care about fuel consumption.
What I don't appreciate is having to resilver significantly more data
than the drive can contain. But when it comes to the crunch, what I'd
really appreciate was a bounded resilver time measured in hours not days
or weeks.
For a long time at Sun, the rule was "correctness is a constraint,
performance is a goal". However, in the real world, performance is
often also a constraint (just as a quick but erroneous answer is a
wrong answer, so also, a slow but correct answer can also be "wrong").
Then one brave soul at Sun once ventured that "if Linux is faster,
it's a Solaris bug!" and to his surprise, the idea caught on. I later
went on to tell people that ZFS delievered RAID "where I =
inexpensive", so I'm a just a little frustrated when that promise
becomes less respected over time. First it was USB drives (which I
agreed with), now it's SATA (and I'm not so sure).
"slow" doesn't begin with an "i" :-)
Both ZFS and RAID promised to play in the inexpensive space.
There has been a lot of discussion, anecdotes and some data on this
list.
"slow because I use devices with poor random write(!) performance"
is very different than "broken."
Again, context is everything. For example, if someone was building a
business critical NAS appliance from consumer grade parts, I'd be the
first to say "are you nuts?!"
Unfortunately, the math does not support your position...
Actually, the math (e.g. raw drive metrics) doesn't lead me to expect
such a disparity.
The resilver doesn't do a single pass of the drives, but uses a
"smarter" temporal algorithm based on metadata.
A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.
Actually, it's easy to see how a combined spatial and temporal
approach could be implemented to an advantage for mirrored vdevs.
However, the current implentation has difficulty finishing the job
if there's a steady flow of updates to the pool.
Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.
I was led to believe this was not yet fixed in Solaris 11, and that
there are therefore doubts about what Solaris 10 update may see the
fix, if any.
As far as I'm aware, the only way to get bounded resilver times is
to stop the workload until resilvering is completed.
I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.
I don't share your disbelief or "little difference" analysys. If it
is true that no current implementation succeeds, isn't that a great
opportunity to change the rules? Wasn't resilver time vs availability
was a major factor in Adam Leventhal's paper introducing the need for
RAIDZ3?
No, it wasn't.
Maybe we weren't reading the same paper?
From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a
pointer to Adam's ACM article)
The need for triple-parity RAID
...
The time to populate a drive is directly relevant for RAID rebuild. As
disks in RAID systems take longer to reconstruct, the reliability of
the total system decreases due to increased periods running in a
degraded state. Today that can be four hours or longer; that could
easily grow to days or weeks.
From http://queue.acm.org/detail.cfm?id=1670144 (Adam's ACM article)
While bit error rates have nearly kept pace with the growth in disk
capacity, throughput has not been given its due consideration when
determining RAID reliability.
Whilst Adam does discuss the lack of progress in bit error rates, his
focus (in the article, and in his pointer to it) seems to be on drive
capacity vs data rates, how this impact recovery times, and the
consequential need to protect against multiple overlapping failures.
There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)
Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.
By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec'ed at 1 error per 1e14
bits read. To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit
on a 2TB is growing well above 10%. Some of the better enterprise class
HDDs are rated two orders of magnitude better, but the only way to get
much better is to use more bits for ECC... hence the move towards
4KB sectors.
In other words, the probability of losing data by reading data can be
larger than losing data next year. This is the case for triple parity
RAID.
MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when
a disk fails, it is not considered "repairable", so a better metric is
MTTF (because there are no repairable failures)]
1.6 million hours equates to about 180 years, so why do HDD vendors
guarantee their drives for considerably less (typically 3-5 years)? It's
because they base the figure on a constant failure rate expected during
the normal useful life of the drive (typically 5 years).
However, quoting from http://www.asknumbers.com/WhatisReliability.aspx
Field failures do not generally occur at a uniform rate, but follow a
distribution in time commonly described as a "bathtub curve." The life
of a device can be divided into three regions: Infant Mortality
Period, where the failure rate progressively improves; Useful Life
Period, where the failure rate remains constant; and Wearout Period,
where failure rates begin to increase.
Crucially, the vendor's quoted MTBF figures do not take into account
"infant mortality" or early "wearout". Until every HDD is fitted with an
environmental tell-tale device for shock, vibration, temperature,
pressure, humidity, etc we can't even come close to predicting either
factor.
And this is just the HDD itself. In a system there are many ways to lose
access to an HDD. So I'm exposed when I lose the first drive in a RAIDZ1
(second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer
the resilver takes, the longer I'm exposed.
Add to the mix that Indy 500 drives can degrade to tricyle performance
before they fail utterly, and yes, low performing drives can still be an
issue, even for the elite.
The appropriateness or otherwise of resilver throttling depends on
the context. If I can tolerate further failures without data loss
(e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed
devices), or if I can recover business critical data in a timely
manner, then great. But there may come a point where I would rather
take a short term performance hit to close the window on total data loss.
I agree. Back in the bad old days, we were stuck with silly throttles
on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
on the competing, non-scrub I/O. This works because in ZFS all I/O is not
created equal, unlike the layered RAID implementations such as SVM or
RAID arrays. ZFS schedules the regular workload at a higher priority than
scrubs or resilvers. Add the new throttles and the scheduler is even more
effective. So you get your interactive performance at the cost of longer
resilver times. This is probably a good trade-off for most folks.
The problem exists for mirrors too, but is not as marked because
mirror reconstruction is inherently simpler.
Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.
This only holds in a quiesced system.
The effect will be worse for a mirror because you have direct
competition for the single, surviving HDD. For raidz*, we clearly
see the read workload spread out across the surving disks at
approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
then a resilver will keep the resilvering disk 100% busy writing, and
the data disks approximately 25% busy reading. Later releases of
ZFS will also prefetch the reads and the writes can be coalesced,
skewing the ratio a little bit, but the general case seems to be a
reasonable starting point.
Mirrored systems need more drives to achieve the same capacity, so
mirrored volumes are generally striped by some means, so the equivalent
of your 4+1 RAIDZ1 is actually a 4+4. In such a configuration
resilvering one drive at 100% would also result in a mean hit of 25%.
Obviously, a drive running at 100% has nothing more to give, so for fun
let's throttle the resilver to 25x1MB sequential reads per second (which
is about 25% of a good drive's capacity). At this rate, a 2TB drive will
resilver in under 24 hours, so let's make that the upper bound.
It is highly desirable to throttle the resilver and regular I/O rates
according to required performance and availability metrics, so something
better than 24 hours should be the norm.
It should also be possible for the system to report an ETA based on
current and historic workload statistics. "You may say I'm a dreamer..."
For mirrored vdevs, ZFS could resilver using an efficient block level
copy, whilst keeping a record of progress, and considering copied blocks
as already mirrored and ready to be read and updated by normal activity.
Obviously, it's much harder to apply this approach for RAIDZ.
Since slabs are allocated sequentially, it should also be possible to
set a high water mark for the bulk copy, so that fresh pools with little
or no data could also be resilvered in minutes or seconds.
I believe such an approach would benefit all ZFS users, not just the elite.
-- richard
Phil
p.s. just for the record, Nexenta's Hardware Supported List (HSL) is an
excellent resource for those wanting to build NAS appliances that
actually work...
http://www.nexenta.com/corp/supported-hardware/hardware-supported-list
... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs
(enterprise class drives at near consumer prices)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss