On Dec 21, 2010, at 3:48 AM, Phil Harman wrote:
> On 21/12/2010 05:44, Richard Elling wrote:
>> 
>> On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.har...@gmail.com> wrote:
>>> On 20/12/2010 13:59, Richard Elling wrote:
>>>> 
>>>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com> wrote:
>>>>>> Why does resilvering take so long in raidz anyway?
>>>>> Because it's broken. There were some changes a while back that made it 
>>>>> more broken.
>>>> "broken" is the wrong term here. It functions as designed and correctly 
>>>> resilvers devices. Disagreeing with the design is quite different than
>>>> proving a defect.
>>> It might be the wrong term in general, but I think it does apply in the 
>>> budget home media server context of this thread.
>> If you only have a few slow drives, you don't have performance.
>> Like trying to win the Indianapolis 500 with a tricycle...
> 
> The context of this thread is a budget home media server (certainly not the 
> Indy 500, but perhaps not as humble as tricycle touring either). And whilst 
> it is a habit of the hardware advocate to blame the software ... and vice 
> versa ... it's not much help to those of us trying to build "good enough" 
> systems across the performance and availability spectrum.

it is all in how the expectations are set. For the home user, waiting overnight
for a resilver might not impact their daily lives (switch night/day around for 
developers :-)

>>> I think we can agree that ZFS currently doesn't play well on cheap disks. I 
>>> think we can also agree that the performance of ZFS resilvering is known to 
>>> be suboptimal under certain conditions.
>> ... and those conditions are also a strength. For example, most file
>> systems are nowhere near full. With ZFS you only resilver data. For those
>> who recall the resilver throttles in SVM or VXVM, you will appreciate not
>> having to resilver non-data.
> 
> I'd love to see the data and analysis for the assertion that "most files 
> systems are nowhere near full", discounting, of course, any trivial cases.

I wish I still had access to that data, since I left Sun, I'd be pleasantly  
surprised if 
anyone keeps up with it any more.  But yes, we did track file system 
utilization on 
around 300,000 systems, clearly a statistically significant sample, for Sun's 
market
anyway.  Average space utilization is well below 50%.

> In my experience, in any cost conscious scenario, in the home or the 
> enterprise, the expectation is that I'll get to use the majority of the space 
> I've paid for (generally "through the nose" from the storage silo team in the 
> enterprise scenario). To borrow your illustration, even Indy 500 teams care 
> about fuel consumption.
> 
> What I don't appreciate is having to resilver significantly more data than 
> the drive can contain. But when it comes to the crunch, what I'd really 
> appreciate was a bounded resilver time measured in hours not days or weeks.

For those following along, changeset 12296:7cf402a7f374 on May 3, 2010
brought a number of changes to scrubs and resilvers.

>>> For a long time at Sun, the rule was "correctness is a constraint, 
>>> performance is a goal". However, in the real world, performance is often 
>>> also a constraint (just as a quick but erroneous answer is a wrong answer, 
>>> so also, a slow but correct answer can also be "wrong").
>>> 
>>> Then one brave soul at Sun once ventured that "if Linux is faster, it's a 
>>> Solaris bug!" and to his surprise, the idea caught on. I later went on to 
>>> tell people that ZFS delievered RAID "where I = inexpensive", so I'm a just 
>>> a little frustrated when that promise becomes less respected over time. 
>>> First it was USB drives (which I agreed with), now it's SATA (and I'm not 
>>> so sure).
>> "slow" doesn't begin with an "i" :-)
> 
> Both ZFS and RAID promised to play in the inexpensive space.

And tricycles are less expensive than Indy cars...

>>>>> There has been a lot of discussion, anecdotes and some data on this list. 
>>>> "slow because I use devices with poor random write(!) performance"
>>>> is very different than "broken."
>>> Again, context is everything. For example, if someone was building a 
>>> business critical NAS appliance from consumer grade parts, I'd be the first 
>>> to say "are you nuts?!"
>> Unfortunately, the math does not support your position...
> 
> Actually, the math (e.g. raw drive metrics) doesn't lead me to expect such a 
> disparity.
> 
>>>>> The resilver doesn't do a single pass of the drives, but uses a "smarter" 
>>>>> temporal algorithm based on metadata.
>>>> A design that only does a single pass does not handle the temporal
>>>> changes. Many RAID implementations use a mix of spatial and temporal
>>>> resilvering and suffer with that design decision.
>>> Actually, it's easy to see how a combined spatial and temporal approach 
>>> could be implemented to an advantage for mirrored vdevs.
>>>>> However, the current implentation has difficulty finishing the job if 
>>>>> there's a steady flow of updates to the pool.
>>>> Please define current. There are many releases of ZFS, and
>>>> many improvements have been made over time. What has not
>>>> improved is the random write performance of consumer-grade
>>>> HDDs.
>>> I was led to believe this was not yet fixed in Solaris 11, and that there 
>>> are therefore doubts about what Solaris 10 update may see the fix, if any.
>>>>> As far as I'm aware, the only way to get bounded resilver times is to 
>>>>> stop the workload until resilvering is completed.
>>>> I know of no RAID implementation that bounds resilver times
>>>> for HDDs. I believe it is not possible. OTOH, whether a resilver
>>>> takes 10 seconds or 10 hours makes little difference in data
>>>> availability. Indeed, this is why we often throttle resilvering
>>>> activity. See previous discussions on this forum regarding the
>>>> dueling RFEs.
>>> I don't share your disbelief or "little difference" analysys. If it is true 
>>> that no current implementation succeeds, isn't that a great opportunity to 
>>> change the rules? Wasn't resilver time vs availability was a major factor 
>>> in Adam Leventhal's paper introducing the need for RAIDZ3?
>> 
>> No, it wasn't.
> 
> Maybe we weren't reading the same paper?
> 
> From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a pointer 
> to Adam's ACM article)
>> The need for triple-parity RAID
>> ...
>> The time to populate a drive is directly relevant for RAID rebuild. As disks 
>> in RAID systems take longer to reconstruct, the reliability of the total 
>> system decreases due to increased periods running in a degraded state. Today 
>> that can be four hours or longer; that could easily grow to days or weeks. 
> 
> From http://queue.acm.org/detail.cfm?id=1670144 (Adam's ACM article)
>> While bit error rates have nearly kept pace with the growth in disk 
>> capacity, throughput has not been given its due consideration when 
>> determining RAID reliability.
> 
> Whilst Adam does discuss the lack of progress in bit error rates, his focus 
> (in the article, and in his pointer to it) seems to be on drive capacity vs 
> data rates, how this impact recovery times, and the consequential need to 
> protect against multiple overlapping failures.
> 
>> There are two failure modes we can model given the data
>> provided by disk vendors:
>> 1. failures by time (MTBF)
>> 2. failures by bits read (UER)
>> 
>> Over time, the MTBF has improved, but the failures by bits read has not
>> improved. Just a few years ago enterprise class HDDs had an MTBF
>> of around 1 million hours. Today, they are in the range of 1.6 million
>> hours. Just looking at the size of the numbers, the probability that a
>> drive will fail in one hour is on the order of 10e-6.
>> 
>> By contrast, the failure rate by bits read has not improved much.
>> Consumer class HDDs are usually spec'ed at 1 error per 1e14
>> bits read.  To put this in perspective, a 2TB disk has around 1.6e13
>> bits. Or, the probability of an unrecoverable read if you read every bit 
>> on a 2TB is growing well above 10%. Some of the better enterprise class 
>> HDDs are rated two orders of magnitude better, but the only way to get
>> much better is to use more bits for ECC... hence the move towards
>> 4KB sectors.
>> 
>> In other words, the probability of losing data by reading data can be
>> larger than losing data next year. This is the case for triple parity RAID.
> 
> MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when a 
> disk fails, it is not considered "repairable", so a better metric is MTTF 
> (because there are no repairable failures)]

They are the same in this context.

> 1.6 million hours equates to about 180 years, so why do HDD vendors guarantee 
> their drives for considerably less (typically 3-5 years)? It's because they 
> base the figure on a constant failure rate expected during the normal useful 
> life of the drive (typically 5 years).

MTBF has units of "hours between failures," but is often shortened to "hours."
It is often easier to do the math with Failures in Time (FITs) where Time is a 
billion
hours. There is a direct correlation:

FITs = 1,000,000,000 / MTBF

To put this in perspective, a modern CPU has an MTBF of around 4 million hours
or 250 FITs. A simple PCI card can easily get to 10 million hours, or less than 
100 FITs.

Or, if you prefer, the annualized failure rate (AFR) gives a more intuitive 
response.

AFR  = 8760 hours per year / MTBF

AFR is often represented as a percentage, and ranges of 0.6% to 4% are useful 
for
disks.

Remember, all failures due to wear out and described by MTBF in disks are 
mechanical
failures.

> However, quoting from http://www.asknumbers.com/WhatisReliability.aspx
>> Field failures do not generally occur at a uniform rate, but follow a 
>> distribution in time commonly described as a "bathtub curve." The life of a 
>> device can be divided into three regions: Infant Mortality Period, where the 
>> failure rate progressively improves; Useful Life Period, where the failure 
>> rate remains constant; and Wearout Period, where failure rates begin to 
>> increase.
> 
> Crucially, the vendor's quoted MTBF figures do not take into account "infant 
> mortality" or early "wearout". Until every HDD is fitted with an 
> environmental tell-tale device for shock, vibration, temperature, pressure, 
> humidity, etc we can't even come close to predicting either factor.

Yes we can, and yes we do. All you need is a large enough sample size.
In many cases, the changes in failure rates occur because of events not 
considered in MTBF calculations: factory defects, contamination, environmental
conditions, physical damage, firmware bugs, etc.  

> And this is just the HDD itself. In a system there are many ways to lose 
> access to an HDD. So I'm exposed when I lose the first drive in a RAIDZ1 
> (second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer the 
> resilver takes, the longer I'm exposed.

Indeed.  Let's look at the math. For the simple MTTDL[1] model, that does not 
consider UER, we calculate the probability that we we have a second failure
during the repair time:
        single parity :
                MTTDL[1] = MTBF^2 / (N*(N-1) * MTTR)

        double parity:
                MTTDL[1] = MTBF^3 / (N * (N-1) * (N-2) * MTTR^2)

Mean Time To Repair (MTTR) includes logistical replacement and resilvering 
time, 
so this model can show the advantage of hot spares (by reducing logistical 
replacement time).

The practical use of this model makes sense where MTTR is on the
order of 10s or 100s of hours while the MTBF is on the order of 1 million 
hours.

But the more difficult problem arises with the UER spec.  A consumer-grade
disk typically has a UER rating of error per 10^14 bits read. 10^14 bits is 
around 8 2TB drives. In other words, the probability of having an UER
during reconstruction of an 8+1 raidz using 2TB consumer-grade drives is
more like 63%, much higher than the MTTDL[1] model implies. We are just now 
seeing enterprise-class drives with a UER rating of 1 error per 10^16 bits read.
http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/NS/Cheetah%20NS%2010K.2/100516228d.pdf


> Add to the mix that Indy 500 drives can degrade to tricyle performance before 
> they fail utterly, and yes, low performing drives can still be an issue, even 
> for the elite.

Yes. I feel this will become the dominant issue with HDDs and one where there
is plenty of room for improvement in ZFS.

>>> The appropriateness or otherwise of resilver throttling depends on the 
>>> context. If I can tolerate further failures without data loss (e.g. RAIDZ2 
>>> with one failed device, or RAIDZ3 with two failed devices), or if I can 
>>> recover business critical data in a timely manner, then great. But there 
>>> may come a point where I would rather take a short term performance hit to 
>>> close the window on total data loss.
>> 
>> I agree. Back in the bad old days, we were stuck with silly throttles
>> on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
>> on the competing, non-scrub I/O. This works because in ZFS all I/O is not
>> created equal, unlike the layered RAID implementations such as SVM or
>> RAID arrays. ZFS schedules the regular workload at a higher priority than
>> scrubs or resilvers. Add the new throttles and the scheduler is even more
>> effective. So you get your interactive performance at the cost of longer
>> resilver times. This is probably a good trade-off for most folks.
>> 
>>>>> The problem exists for mirrors too, but is not as marked because mirror 
>>>>> reconstruction is inherently simpler.
>>>> 
>>>> Resilver time is bounded by the random write performance of
>>>> the resilvering device. Mirroring or raidz make no difference.
>>> 
>>> This only holds in a quiesced system.
>> 
>> The effect will be worse for a mirror because you have direct
>> competition for the single, surviving HDD. For raidz*, we clearly
>> see the read workload spread out across the surving disks at
>> approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
>> then a resilver will keep the resilvering disk 100% busy writing, and 
>> the data disks approximately 25% busy reading. Later releases of 
>> ZFS will also prefetch the reads and the writes can be coalesced,
>> skewing the ratio a little bit, but the general case seems to be a
>> reasonable starting point.
> 
> Mirrored systems need more drives to achieve the same capacity, so mirrored 
> volumes are generally striped by some means, so the equivalent of your 4+1 
> RAIDZ1 is actually a 4+4. In such a configuration resilvering one drive at 
> 100% would also result in a mean hit of 25%.

For HDDs, writes take longer than reads, so reality is much more difficult to
model.  This is further complicated by ZFS's I/O scheduler, track read buffers,
ZFS prefetching, and the async nature of resilvering writes.

> Obviously, a drive running at 100% has nothing more to give, so for fun let's 
> throttle the resilver to 25x1MB sequential reads per second (which is about 
> 25% of a good drive's capacity). At this rate, a 2TB drive will resilver in 
> under 24 hours, so let's make that the upper bound.

OK.  I think this is a fair goal.  It is certainly easier to achieve than the 
4.5 hours
you can expect for sustained writes to the media.

> It is highly desirable to throttle the resilver and regular I/O rates 
> according to required performance and availability metrics, so something 
> better than 24 hours should be the norm.
> 
> It should also be possible for the system to report an ETA based on current 
> and historic workload statistics. "You may say I'm a dreamer..."

That is what happens today, but the algorithm doesn't work well for devices
with widely varying random performance profiles (eg HDDs).  As the resilver
throttle kicks in, due to other I/O taking priority, the resilver time is even 
more
unpredictable.

An amusing CR is 6973953, where the "solution" is "do not print estimated 
time if hours_left is more than 30 days"
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6973953

> For mirrored vdevs, ZFS could resilver using an efficient block level copy, 
> whilst keeping a record of progress, and considering copied blocks as already 
> mirrored and ready to be read and updated by normal activity. Obviously, it's 
> much harder to apply this approach for RAIDZ.
> 
> Since slabs are allocated sequentially, it should also be possible to set a 
> high water mark for the bulk copy, so that fresh pools with little or no data 
> could also be resilvered in minutes or seconds.

That is the case today.  Try it :-)
 -- richard

> I believe such an approach would benefit all ZFS users, not just the elite.
> 
>>  -- richard
> 
> Phil
> 
> p.s. just for the record, Nexenta's Hardware Supported List (HSL) is an 
> excellent resource for those wanting to build NAS appliances that actually 
> work...
> 
>    http://www.nexenta.com/corp/supported-hardware/hardware-supported-list
> 
> ... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs (enterprise 
> class drives at near consumer prices)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to