comment below...

Peter Schuller wrote:
In many situations it may not feel worth it to move to a raidz2 just to
avoid this particular case.
I can't think of any, but then again, I get paid to worry about failures
:-)

Given that one of the tauted features of ZFS is data integrity, including in the case of cheap drives, that implies it is of interest to get maximum integrity with any given amount of resources.

In your typical home use situation for example, buying 4 drives of decent size is pretty expensive considering that it *is* home use. Getting 4 drives for the diskspace of 3 is a lot more attractive than 5 drives for the diskspace of 3. But given that you do get 4 drives and put them in a raidz, you want as much safety as possible, and often you don't care that much about availability.

That said, the argument scales. If you're not in a situation like the above, you may easily warrant "wasting" an extra drive on raidz2. But raidz2 without this feature is still less safe than raidz2 with the feature. So moving back to the idea of getting as much redundancy as possible given a certain set of hardware resources, you're still not optimal given your hardware.

Please correct me if I misunderstand your reasoning, are you saying that a
broken disk should not be replaced?

Sorry, no. However, I realize my desire actually requires an additional feature. The situation I envision situation is this:

* One disk goes down in a raidz, because the controller suddenly broke (platters/heads are fine).

* You replace the disk and start a re-silvering.

* You trigger a bad block. At this point, you are now pretty screwed, unless:

* The pool did not change after the original drive failed, AND a "broken drive assisted" resilvering is supported. You go to whatever effort required to fix the disk (say, buy another one of the same model and replace the controller, or hire some company that does this stuff), re-insert it into the machine.

* At this point you have a drive you can read data off of, but that you certainly don't trust in general. So you want to start replacing the drive with the new drive; if ZFS were then able to resilver to the new drive by using both parity data on other healthy drives in the pool, and the disk being replaced, you're a happy.

It is my understanding that zpool replace already does this.  Just don't
remove the failing disk...

Or let's do a more likely senario. A disk starts dying because of bad sectors (the disk has run out of remapping possibilities). You cannot fix this anymore by re-writing the bad sectors; trying to re-write the sector ends up failing with an I/O error and ZFS kicks the disk out of the pool.

Standard procedure at this point is to replace the drive and resilver. But once again - you might end up with a bad sector on another drive. Without utilizing the existing broken drive, you're screwed. If however you were able to take advantage of sectors that *ARE* readable off of the drive, and the drive has *NOT* gone out of date since it was kicked out due to additional transaction commits, you are once again happy.

(Once again assuming you don't happen to have bad luck and the set of bad sectors on the two drives overlap.)

...
I think I was off base previously.  It seems to me that you are really after
the policy for failing/failed disks.  Currently, the only way a drive gets
"kicked out" is if ZFS cannot open it.  Obviously, if ZFS cannot open the
drive, then you won't be able to read anything from it.

Looking forward, I think that there are several policies which may be desired...

If so, then that is contrary to the accepted methods used in most mission critical systems. There may be other
methods which meet your requirements and are accepted.  For example, one
procedure we see for those sites who are very interested in data retention
is to power off a system when it is degraded to a point (as specified)
where data retention is put at unacceptable risk.

This is kind of what I am after, except that I want to guarantee that not a single transaction gets committed once a pool is degraded. Even if an admin goes and turns the machine off, the disk will be out of date.

... such as a policy that says "if a disk is going bad, go read-only."  I'm
quite sure that most applications won't respond well to such a policy, though.

The theory is that a powered down system will stop wearing out. When the system is serviced,
then it can be brought back online.  Obviously, this is not the case where
data availability is a primary requirement -- data retention has higher
priority.

On the other hand, hardware has a nasty tendancy to break in relation to power cycles...

We can already set a pool (actually the file systems in a pool) to be read
only.

Automatically and *immediately* on a drive failure?

You can listen to sysevents and implement policies.

There may be something else lurking here that we might be able to take
advantage of, at least for some cases.  Since ZFS is COW, it doesn't have
the same "data loss" profile as other file systems, like UFS, which can
overwrite the data making reconstruction difficult or impossible.  But
while this might be useful for forensics, the general case is perhaps
largely covered by the existing snapshot features.

Heh, in an ideal world - have ZFS automatically create a snapshot when a pool degrades. Normal case is then continued read/write operation. But if you DO end up with a bad situation and a double bad sector, you could then resilver based on the snapshot (from the perspective of which, the partially failed drive is up to date) and at least get back an older version of the data, rather than no data at all.

As a compromise, if you cannot afford immediate read-only mode for availability reasons.

I suppose that if a resilvering can be performed relative to any arbitrary node considered the root node, it might even be realistic to implement?

If I understand correctly, resilvering occurs at the zpool, not the file
system level.

I think a better policy may be to initiate a scrub when a failed read
occurs (along with a SERD policy).  The scrub will remap bad blocks that
it can recover.

N.B. I do have a lot of field data on failures and failure rates.  It is
often difficult to grok without having a clear objective in mind.  We may
be able to agree on a set of questions which would quantify the need for
your ideas.  Feel free to contact me directly.

Thanks. It's not that I have any particular situation where this becomes more important than usual. It is just a general observation of a behavior which, in cases where availability is not important, is sub-optimal from a data safety perspective. The only reason I even brought it up was the focus on data integrity that we see with ZFS.

In any case, this is a job for ZFS+FMA integration.
 -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to