On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com <mailto:phil.har...@gmail.com>> wrote:

Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made it more broken.

"broken" is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.

It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread. I think we can agree that ZFS currently doesn't play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions.

For a long time at Sun, the rule was "correctness is a constraint, performance is a goal". However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be "wrong").

Then one brave soul at Sun once ventured that "if Linux is faster, it's a Solaris bug!" and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID "where I = inexpensive", so I'm a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it's SATA (and I'm not so sure).

There has been a lot of discussion, anecdotes and some data on this list.

"slow because I use devices with poor random write(!) performance"
is very different than "broken."

Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I'd be the first to say "are you nuts?!"

The resilver doesn't do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata.

A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.

Actually, it's easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs.

However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool.

Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.

I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any.

As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed.

I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.

I don't share your disbelief or "little difference" analysys. If it is true that no current implementation succeeds, isn't that a great opportunity to change the rules? Wasn't resilver time vs availability was a major factor in Adam Leventhal's paper introducing the need for RAIDZ3?

The appropriateness or otherwise of resilver throttling depends on the context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if I can recover business critical data in a timely manner, then great. But there may come a point where I would rather take a short term performance hit to close the window on total data loss.

The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler.

Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.

This only holds in a quiesced system.

I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us.

Some "improvements" were made post-b134 and pre-b148.

That is, indeed, good news.

 -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to