> > In many situations it may not feel worth it to move to a raidz2 just to
> > avoid this particular case.
>
> I can't think of any, but then again, I get paid to worry about failures
> :-)

Given that one of the tauted features of ZFS is data integrity, including in 
the case of cheap drives, that implies it is of interest to get maximum 
integrity with any given amount of resources.

In your typical home use situation for example, buying 4 drives of decent size 
is pretty expensive considering that it *is* home use. Getting 4 drives for 
the diskspace of 3 is a lot more attractive than 5 drives for the diskspace 
of 3. But given that you do get 4 drives and put them in a raidz, you want as 
much safety as possible, and often you don't care that much about 
availability.

That said, the argument scales. If you're not in a situation like the above, 
you may easily warrant "wasting" an extra drive on raidz2. But raidz2 without 
this feature is still less safe than raidz2 with the feature. So moving back 
to the idea of getting as much redundancy as possible given a certain set of 
hardware resources, you're still not optimal given your hardware.

> Please correct me if I misunderstand your reasoning, are you saying that a
> broken disk should not be replaced?

Sorry, no. However, I realize my desire actually requires an additional 
feature. The situation I envision situation is this:

* One disk goes down in a raidz, because the controller suddenly broke 
(platters/heads are fine).

* You replace the disk and start a re-silvering.

* You trigger a bad block. At this point, you are now pretty screwed, unless:

* The pool did not change after the original drive failed, AND a "broken drive 
assisted" resilvering is supported. You go to whatever effort required to fix 
the disk (say, buy another one of the same model and replace the controller, 
or hire some company that does this stuff), re-insert it into the machine.

* At this point you have a drive you can read data off of, but that you 
certainly don't trust in general. So you want to start replacing the drive 
with the new drive; if ZFS were then able to resilver to the new drive by 
using both parity data on other healthy drives in the pool, and the disk 
being replaced, you're a happy.

Or let's do a more likely senario. A disk starts dying because of bad sectors 
(the disk has run out of remapping possibilities). You cannot fix this 
anymore by re-writing the bad sectors; trying to re-write the sector ends up 
failing with an I/O error and ZFS kicks the disk out of the pool.

Standard procedure at this point is to replace the drive and resilver. But 
once again - you might end up with a bad sector on another drive. Without 
utilizing the existing broken drive, you're screwed. If however you were able 
to take advantage of sectors that *ARE* readable off of the drive, and the 
drive has *NOT* gone out of date since it was kicked out due to additional 
transaction commits, you are once again happy.

(Once again assuming you don't happen to have bad luck and the set of bad 
sectors on the two drives overlap.)

> If so, then that is contrary to the 
> accepted methods used in most mission critical systems.  There may be other
> methods which meet your requirements and are accepted.  For example, one
> procedure we see for those sites who are very interested in data retention
> is to power off a system when it is degraded to a point (as specified)
> where data retention is put at unacceptable risk.

This is kind of what I am after, except that I want to guarantee that not a 
single transaction gets committed once a pool is degraded. Even if an admin 
goes and turns the machine off, the disk will be out of date.

> The theory is that a 
> powered down system will stop wearing out.  When the system is serviced,
> then it can be brought back online.  Obviously, this is not the case where
> data availability is a primary requirement -- data retention has higher
> priority.

On the other hand, hardware has a nasty tendancy to break in relation to power 
cycles...

> We can already set a pool (actually the file systems in a pool) to be read
> only.

Automatically and *immediately* on a drive failure?

> There may be something else lurking here that we might be able to take
> advantage of, at least for some cases.  Since ZFS is COW, it doesn't have
> the same "data loss" profile as other file systems, like UFS, which can
> overwrite the data making reconstruction difficult or impossible.  But
> while this might be useful for forensics, the general case is perhaps
> largely covered by the existing snapshot features.

Heh, in an ideal world - have ZFS automatically create a snapshot when a pool 
degrades. Normal case is then continued read/write operation. But if you DO 
end up with a bad situation and a double bad sector, you could then resilver 
based on the snapshot (from the perspective of which, the partially failed 
drive is up to date) and at least get back an older version of the data, 
rather than no data at all.

As a compromise, if you cannot afford immediate read-only mode for 
availability reasons.

 I suppose that if a resilvering can be performed relative to any arbitrary 
node considered the root node, it might even be realistic to implement?

> N.B. I do have a lot of field data on failures and failure rates.  It is
> often difficult to grok without having a clear objective in mind.  We may
> be able to agree on a set of questions which would quantify the need for
> your ideas.  Feel free to contact me directly.

Thanks. It's not that I have any particular situation where this becomes more 
important than usual. It is just a general observation of a behavior which, 
in cases where availability is not important, is sub-optimal from a data 
safety perspective. The only reason I even brought it up was the focus on 
data integrity that we see with ZFS.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to