Erik,

        just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark

On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:

> On 12/20/2010 9:20 AM, Saxon, Will wrote:
>>> -----Original Message-----
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
>>> Sent: Monday, December 20, 2010 11:46 AM
>>> To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
>>> Subject: Re: [zfs-discuss] A few questions
>>> 
>>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>>> boun...@opensolaris.org] On Behalf Of Lanky Doodle
>>>> 
>>>>> I believe Oracle is aware of the problem, but most of
>>>>> the core ZFS team has left. And of course, a fix for
>>>>> Oracle Solaris no longer means a fix for the rest of
>>>>> us.
>>>> OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
>>> want
>>>> to committ to a file system that is 'broken' and may not be fully fixed,
>>> if at all.
>>> 
>>> ZFS is not "broken."  It is, however, a weak spot, that resilver is very
>>> inefficient.  For example:
>>> 
>>> On my server, which is made up of 10krpm SATA drives, 1TB each...  My
>>> drives
>>> can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
>>> to resilver the entire drive (in a mirror) sequentially, it would take ...
>>> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
>>> and disks are around 70% full, and resilver takes 12-14 hours.
>>> 
>>> So although resilver is "broken" by some standards, it is bounded, and you
>>> can limit it to something which is survivable, by using mirrors instead of
>>> raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
>>> fine.  But you start getting unsustainable if you get up to 21-disk radiz3
>>> for example.
>> This argument keeps coming up on the list, but I don't see where anyone has 
>> made a good suggestion about whether this can even be 'fixed' or how it 
>> would be done.
>> 
>> As I understand it, you have two basic types of array reconstruction: in a 
>> mirror you can make a block-by-block copy and that's easy, but in a parity 
>> array you have to perform a calculation on the existing data and/or existing 
>> parity to reconstruct the missing piece. This is pretty easy when you can 
>> guarantee that all your stripes are the same width, start/end on the same 
>> sectors/boundaries/whatever and thus know a piece of them lives on all 
>> drives in the set. I don't think this is possible with ZFS since we have 
>> variable stripe width. A failed disk d may or may not contain data from 
>> stripe s (or transaction t). This information has to be discovered by 
>> looking at the transaction records. Right?
>> 
>> Can someone speculate as to how you could rebuild a variable stripe width 
>> array without replaying all the available transactions? I am no filesystem 
>> engineer but I can't wrap my head around how this could be handled any 
>> better than it already is. I've read that resilvering is throttled - 
>> presumably to keep performance degradation to a minimum during the process - 
>> maybe this could be a tunable (e.g. priority: low, normal, high)?
>> 
>> Do we know if resilvers on a mirror are actually handled differently from 
>> those on a raidz?
>> 
>> Sorry if this has already been explained. I think this is an issue that 
>> everyone who uses ZFS should understand completely before jumping in, 
>> because the behavior (while not 'wrong') is clearly NOT the same as with 
>> more conventional arrays.
>> 
>> -Will
> the "problem" is NOT the checksum/error correction overhead. that's 
> relatively trivial.  The problem isn't really even variable width (i.e. 
> variable number of disks one crosses) slabs.
> 
> The problem boils down to this:
> 
> When ZFS does a resilver, it walks the METADATA tree to determine what order 
> to rebuild things from. That means, it resilvers the very first slab ever 
> written, then the next oldest, etc.   The problem here is that slab "age" has 
> nothing to do with where that data physically resides on the actual disks. If 
> you've used the zpool as a WORM device, then, sure, there should be a strict 
> correlation between increasing slab age and locality on the disk.  However, 
> in any reasonable case, files get deleted regularly. This means that the 
> probability that for a slab B, written immediately after slab A, it WON'T be 
> physically near slab A.
> 
> In the end, the problem is that using metadata order, while reducing the 
> total amount of work to do in the resilver (as you only resilver live data, 
> not every bit on the drive), increases the physical inefficiency for each 
> slab.  That is, seek time between cyclinders begins to dominate your slab 
> reconstruction time.  In RAIDZ, this problem is magnified by both the much 
> larger average vdev size vs mirrors, and the necessity that all drives 
> containing a slab information return that data before the corrected data can 
> be written to the resilvering drive.
> 
> Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput 
> limited.  This is really the "fault" of the underlying media, not ZFS.  For 
> instance, if you have a raidZ of SSDs (where seek time is negligible, but 
> throughput isn't),  they resilver really, really fast. In fact, they resilver 
> at the maximum write throughput rate.   However, HDs are severely 
> seek-limited, so that dominates HD resilver time.
> 
> 
> The "answer" isn't simple, as the problem is media-specific.
> 
> -- 
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to