Re: [zfs-discuss] A few questions

Richard Elling Mon, 20 Dec 2010 21:44:18 -0800

On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.har...@gmail.com> wrote:

> On 20/12/2010 13:59, Richard Elling wrote:
>> 
>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com> wrote:
>> 
>>> 
>>>> Why does resilvering take so long in raidz anyway?
>>> Because it's broken. There were some changes a while back that made it more 
>>> broken.
>> 
>> "broken" is the wrong term here. It functions as designed and correctly 
>> resilvers devices. Disagreeing with the design is quite different than
>> proving a defect.
> 
> It might be the wrong term in general, but I think it does apply in the 
> budget home media server context of this thread.

If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...

> I think we can agree that ZFS currently doesn't play well on cheap disks. I 
> think we can also agree that the performance of ZFS resilvering is known to 
> be suboptimal under certain conditions.

... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.

> For a long time at Sun, the rule was "correctness is a constraint, 
> performance is a goal". However, in the real world, performance is often also 
> a constraint (just as a quick but erroneous answer is a wrong answer, so 
> also, a slow but correct answer can also be "wrong").
> 
> Then one brave soul at Sun once ventured that "if Linux is faster, it's a 
> Solaris bug!" and to his surprise, the idea caught on. I later went on to 
> tell people that ZFS delievered RAID "where I =     inexpensive", so I'm a 
> just a little frustrated when that promise becomes less respected over time. 
> First it was USB drives (which I agreed with), now it's SATA (and I'm not so 
> sure).

"slow" doesn't begin with an "i" :-)

> 
>> 
>>> There has been a lot of discussion, anecdotes and some data on this list. 
>> 
>> "slow because I use devices with poor random write(!) performance"
>> is very different than "broken."
> 
> Again, context is everything. For example, if someone was building a business 
> critical NAS appliance from consumer grade parts, I'd be the first to say 
> "are you nuts?!"

Unfortunately, the math does not support your position...

> 
>> 
>>> The resilver doesn't do a single pass of the drives, but uses a "smarter" 
>>> temporal algorithm based on metadata.
>> 
>> A design that only does a single pass does not handle the temporal
>> changes. Many RAID implementations use a mix of spatial and temporal
>> resilvering and suffer with that design decision.
> 
> Actually, it's easy to see how a combined spatial and temporal approach could 
> be implemented to an advantage for mirrored vdevs.
> 
>> 
>>> However, the current implentation has difficulty finishing the job if 
>>> there's a steady flow of updates to the pool.
>> 
>> Please define current. There are many releases of ZFS, and
>> many improvements have been made over time. What has not
>> improved is the random write performance of consumer-grade
>> HDDs.
> 
> I was led to believe this was not yet fixed in Solaris 11, and that there are 
> therefore doubts about what Solaris 10 update may see the fix, if any.
> 
>> 
>>> As far as I'm aware, the only way to get bounded resilver times is to stop 
>>> the workload until resilvering is completed.
>> 
>> I know of no RAID implementation that bounds resilver times
>> for HDDs. I believe it is not possible. OTOH, whether a resilver
>> takes 10 seconds or 10 hours makes little difference in data
>> availability. Indeed, this is why we often throttle resilvering
>> activity. See previous discussions on this forum regarding the
>> dueling RFEs.
> 
> I don't share your disbelief or "little difference" analysys. If it is true 
> that no current implementation succeeds, isn't that a great opportunity to 
> change the rules? Wasn't resilver time vs availability was a major factor in 
> Adam Leventhal's paper introducing the need for RAIDZ3?

No, it wasn't. There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)

Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.

By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec'ed at 1 error per 1e14
bits read.  To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit 
on a 2TB is growing well above 10%. Some of the better enterprise class 
HDDs are rated two orders of magnitude better, but the only way to get
much better is to use more bits for ECC... hence the move towards
4KB sectors.

In other words, the probability of losing data by reading data can be
larger than losing data next year. This is the case for triple parity RAID.

> The appropriateness or otherwise of resilver throttling depends on the 
> context. If I can tolerate further failures without data loss (e.g. RAIDZ2 
> with one failed device, or RAIDZ3 with two failed devices), or if I can 
> recover business critical data in a timely manner, then great. But there may 
> come a point where I would rather take a short term performance hit to close 
> the window on total data loss.

I agree. Back in the bad old days, we were stuck with silly throttles
on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
on the competing, non-scrub I/O. This works because in ZFS all I/O is not
created equal, unlike the layered RAID implementations such as SVM or
RAID arrays. ZFS schedules the regular workload at a higher priority than
scrubs or resilvers. Add the new throttles and the scheduler is even more
effective. So you get your interactive performance at the cost of longer
resilver times. This is probably a good trade-off for most folks.

> 
>> 
>>> The problem exists for mirrors too, but is not as marked because mirror 
>>> reconstruction is inherently simpler.
>> 
>> Resilver time is bounded by the random write performance of
>> the resilvering device. Mirroring or raidz make no difference.
> 
> This only holds in a quiesced system.

The effect will be worse for a mirror because you have direct
competition for the single, surviving HDD. For raidz*, we clearly
see the read workload spread out across the surving disks at
approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
then a resilver will keep the resilvering disk 100% busy writing, and 
the data disks approximately 25% busy reading. Later releases of 
ZFS will also prefetch the reads and the writes can be coalesced,
skewing the ratio a little bit, but the general case seems to be a
reasonable starting point.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Reply via email to