On 12/20/2010 9:20 AM, Saxon, Will wrote:
-----Original Message-----
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle
I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.
OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
want
to committ to a file system that is 'broken' and may not be fully fixed,
if at all.
ZFS is not "broken." It is, however, a weak spot, that resilver is very
inefficient. For example:
On my server, which is made up of 10krpm SATA drives, 1TB each... My
drives
can each sustain 1Gbit/sec sequential read/write. This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.
So although resilver is "broken" by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz. For most people, even using 5-disk, or 7-disk raidzN will still be
fine. But you start getting unsustainable if you get up to 21-disk radiz3
for example.
This argument keeps coming up on the list, but I don't see where anyone has
made a good suggestion about whether this can even be 'fixed' or how it would
be done.
As I understand it, you have two basic types of array reconstruction: in a
mirror you can make a block-by-block copy and that's easy, but in a parity
array you have to perform a calculation on the existing data and/or existing
parity to reconstruct the missing piece. This is pretty easy when you can
guarantee that all your stripes are the same width, start/end on the same
sectors/boundaries/whatever and thus know a piece of them lives on all drives
in the set. I don't think this is possible with ZFS since we have variable
stripe width. A failed disk d may or may not contain data from stripe s (or
transaction t). This information has to be discovered by looking at the
transaction records. Right?
Can someone speculate as to how you could rebuild a variable stripe width array
without replaying all the available transactions? I am no filesystem engineer
but I can't wrap my head around how this could be handled any better than it
already is. I've read that resilvering is throttled - presumably to keep
performance degradation to a minimum during the process - maybe this could be a
tunable (e.g. priority: low, normal, high)?
Do we know if resilvers on a mirror are actually handled differently from those
on a raidz?
Sorry if this has already been explained. I think this is an issue that
everyone who uses ZFS should understand completely before jumping in, because
the behavior (while not 'wrong') is clearly NOT the same as with more
conventional arrays.
-Will
As far as a possible fix, here's what I can see:
[Note: I'm not a kernel or FS-level developer. I would love to be able
to fix this myself, but I have neither the aptitude nor the [extensive]
time to learn such skill]
We can either (a) change how ZFS does resilvering or (b) repack the
zpool layouts to avoid the problem in the first place.
In case (a), my vote would be to seriously increase the number of
in-flight resilver slabs, AND allow for out-of-time-order slab
resilvering. By that, I mean that ZFS would read several
disk-sequential slabs, and then mark them as "done". This would mean a
*lot* of scanning the metadata tree (since leaves all over the place
could be "done"). Frankly, I can't say how bad that would be; the
problem is that for ANY resilver, ZFS would have to scan the entire
metadata tree to see if it had work to do, rather than simply look for
the latest completed leave, then assume everything after that needs to
be done. There'd also be the matter of determining *if* one should read
a disk sector...
In case (b), we need the ability to move slabs around on the physical
disk (via the mythical "Block Pointer Re-write" method). If there is
that underlying mechanism, then a "defrag" utility can be run to repack
the zpool to the point where chronological creation time = physical
layout. Which then substantially mitigates the seek time problem.
I can't fix (a) - I don't understand the codebase well enough. Neither
can I do the BP-rewrite implementation. However, if I can get
BP-rewrite, I've got a prototype defragger that seems to work well
(under simulation). I'm sure it could use some performance improvement,
but it works reasonably well on a simulated fragmented pool.
Please, Santa, can a good little boy get a BP-rewrite code commit in his
stocking for Christmas?
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss