Re: [zfs-discuss] A few questions

Erik Trimble Mon, 20 Dec 2010 11:29:42 -0800

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-----Original Message-----
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle

I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not "broken."  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is "broken" by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will

the "problem" is NOT the checksum/error correction overhead. that'srelatively trivial. The problem isn't really even variable width (i.e.variable number of disks one crosses) slabs.


The problem boils down to this:

When ZFS does a resilver, it walks the METADATA tree to determine whatorder to rebuild things from. That means, it resilvers the very firstslab ever written, then the next oldest, etc. The problem here is thatslab "age" has nothing to do with where that data physically resides onthe actual disks. If you've used the zpool as a WORM device, then, sure,there should be a strict correlation between increasing slab age andlocality on the disk. However, in any reasonable case, files getdeleted regularly. This means that the probability that for a slab B,written immediately after slab A, it WON'T be physically near slab A.

In the end, the problem is that using metadata order, while reducing thetotal amount of work to do in the resilver (as you only resilver livedata, not every bit on the drive), increases the physical inefficiencyfor each slab. That is, seek time between cyclinders begins to dominateyour slab reconstruction time. In RAIDZ, this problem is magnified byboth the much larger average vdev size vs mirrors, and the necessitythat all drives containing a slab information return that data beforethe corrected data can be written to the resilvering drive.

Thus, current ZFS resilvering tends to be seek-time limited, NOTthroughput limited. This is really the "fault" of the underlying media,not ZFS. For instance, if you have a raidZ of SSDs (where seek time isnegligible, but throughput isn't), they resilver really, really fast.In fact, they resilver at the maximum write throughput rate. However,HDs are severely seek-limited, so that dominates HD resilver time.



The "answer" isn't simple, as the problem is media-specific.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Reply via email to