Re: [zfs-discuss] A few questions

Phil Harman Mon, 20 Dec 2010 07:33:13 -0800

On 20/12/2010 13:59, Richard Elling wrote:

On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.har...@gmail.com<mailto:phil.har...@gmail.com>> wrote:
Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that madeit more broken.
"broken" is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.

It might be the wrong term in general, but I think it does apply in thebudget home media server context of this thread. I think we can agreethat ZFS currently doesn't play well on cheap disks. I think we can alsoagree that the performance of ZFS resilvering is known to be suboptimalunder certain conditions.

For a long time at Sun, the rule was "correctness is a constraint,performance is a goal". However, in the real world, performance is oftenalso a constraint (just as a quick but erroneous answer is a wronganswer, so also, a slow but correct answer can also be "wrong").

Then one brave soul at Sun once ventured that "if Linux is faster, it'sa Solaris bug!" and to his surprise, the idea caught on. I later went onto tell people that ZFS delievered RAID "where I = inexpensive", so I'ma just a little frustrated when that promise becomes less respected overtime. First it was USB drives (which I agreed with), now it's SATA (andI'm not so sure).

There has been a lot of discussion, anecdotes and some data on thislist.
"slow because I use devices with poor random write(!) performance"
is very different than "broken."

Again, context is everything. For example, if someone was building abusiness critical NAS appliance from consumer grade parts, I'd be thefirst to say "are you nuts?!"

The resilver doesn't do a single pass of the drives, but uses a"smarter" temporal algorithm based on metadata.
A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.

Actually, it's easy to see how a combined spatial and temporal approachcould be implemented to an advantage for mirrored vdevs.

However, the current implentation has difficulty finishing the job ifthere's a steady flow of updates to the pool.
Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.

I was led to believe this was not yet fixed in Solaris 11, and thatthere are therefore doubts about what Solaris 10 update may see the fix,if any.

As far as I'm aware, the only way to get bounded resilver times is tostop the workload until resilvering is completed.


I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.

I don't share your disbelief or "little difference" analysys. If it istrue that no current implementation succeeds, isn't that a greatopportunity to change the rules? Wasn't resilver time vs availabilitywas a major factor in Adam Leventhal's paper introducing the need forRAIDZ3?

The appropriateness or otherwise of resilver throttling depends on thecontext. If I can tolerate further failures without data loss (e.g.RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or ifI can recover business critical data in a timely manner, then great. Butthere may come a point where I would rather take a short termperformance hit to close the window on total data loss.

The problem exists for mirrors too, but is not as marked becausemirror reconstruction is inherently simpler.
Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.


This only holds in a quiesced system.

I believe Oracle is aware of the problem, but most of the core ZFSteam has left. And of course, a fix for Oracle Solaris no longermeans a fix for the rest of us.
Some "improvements" were made post-b134 and pre-b148.


That is, indeed, good news.

 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Reply via email to