comment below...
Peter Schuller wrote:
In many situations it may not feel worth it to move to a raidz2 just to
avoid this particular case.
I can't think of any, but then again, I get paid to worry about failures
:-)
Given that one of the tauted features of ZFS is data integrity, including in
the case of cheap drives, that implies it is of interest to get maximum
integrity with any given amount of resources.
In your typical home use situation for example, buying 4 drives of decent size
is pretty expensive considering that it *is* home use. Getting 4 drives for
the diskspace of 3 is a lot more attractive than 5 drives for the diskspace
of 3. But given that you do get 4 drives and put them in a raidz, you want as
much safety as possible, and often you don't care that much about
availability.
That said, the argument scales. If you're not in a situation like the above,
you may easily warrant "wasting" an extra drive on raidz2. But raidz2 without
this feature is still less safe than raidz2 with the feature. So moving back
to the idea of getting as much redundancy as possible given a certain set of
hardware resources, you're still not optimal given your hardware.
Please correct me if I misunderstand your reasoning, are you saying that a
broken disk should not be replaced?
Sorry, no. However, I realize my desire actually requires an additional
feature. The situation I envision situation is this:
* One disk goes down in a raidz, because the controller suddenly broke
(platters/heads are fine).
* You replace the disk and start a re-silvering.
* You trigger a bad block. At this point, you are now pretty screwed, unless:
* The pool did not change after the original drive failed, AND a "broken drive
assisted" resilvering is supported. You go to whatever effort required to fix
the disk (say, buy another one of the same model and replace the controller,
or hire some company that does this stuff), re-insert it into the machine.
* At this point you have a drive you can read data off of, but that you
certainly don't trust in general. So you want to start replacing the drive
with the new drive; if ZFS were then able to resilver to the new drive by
using both parity data on other healthy drives in the pool, and the disk
being replaced, you're a happy.
It is my understanding that zpool replace already does this. Just don't
remove the failing disk...
Or let's do a more likely senario. A disk starts dying because of bad sectors
(the disk has run out of remapping possibilities). You cannot fix this
anymore by re-writing the bad sectors; trying to re-write the sector ends up
failing with an I/O error and ZFS kicks the disk out of the pool.
Standard procedure at this point is to replace the drive and resilver. But
once again - you might end up with a bad sector on another drive. Without
utilizing the existing broken drive, you're screwed. If however you were able
to take advantage of sectors that *ARE* readable off of the drive, and the
drive has *NOT* gone out of date since it was kicked out due to additional
transaction commits, you are once again happy.
(Once again assuming you don't happen to have bad luck and the set of bad
sectors on the two drives overlap.)
...
I think I was off base previously. It seems to me that you are really after
the policy for failing/failed disks. Currently, the only way a drive gets
"kicked out" is if ZFS cannot open it. Obviously, if ZFS cannot open the
drive, then you won't be able to read anything from it.
Looking forward, I think that there are several policies which may be desired...
If so, then that is contrary to the
accepted methods used in most mission critical systems. There may be other
methods which meet your requirements and are accepted. For example, one
procedure we see for those sites who are very interested in data retention
is to power off a system when it is degraded to a point (as specified)
where data retention is put at unacceptable risk.
This is kind of what I am after, except that I want to guarantee that not a
single transaction gets committed once a pool is degraded. Even if an admin
goes and turns the machine off, the disk will be out of date.
... such as a policy that says "if a disk is going bad, go read-only." I'm
quite sure that most applications won't respond well to such a policy, though.
The theory is that a
powered down system will stop wearing out. When the system is serviced,
then it can be brought back online. Obviously, this is not the case where
data availability is a primary requirement -- data retention has higher
priority.
On the other hand, hardware has a nasty tendancy to break in relation to power
cycles...
We can already set a pool (actually the file systems in a pool) to be read
only.
Automatically and *immediately* on a drive failure?
You can listen to sysevents and implement policies.
There may be something else lurking here that we might be able to take
advantage of, at least for some cases. Since ZFS is COW, it doesn't have
the same "data loss" profile as other file systems, like UFS, which can
overwrite the data making reconstruction difficult or impossible. But
while this might be useful for forensics, the general case is perhaps
largely covered by the existing snapshot features.
Heh, in an ideal world - have ZFS automatically create a snapshot when a pool
degrades. Normal case is then continued read/write operation. But if you DO
end up with a bad situation and a double bad sector, you could then resilver
based on the snapshot (from the perspective of which, the partially failed
drive is up to date) and at least get back an older version of the data,
rather than no data at all.
As a compromise, if you cannot afford immediate read-only mode for
availability reasons.
I suppose that if a resilvering can be performed relative to any arbitrary
node considered the root node, it might even be realistic to implement?
If I understand correctly, resilvering occurs at the zpool, not the file
system level.
I think a better policy may be to initiate a scrub when a failed read
occurs (along with a SERD policy). The scrub will remap bad blocks that
it can recover.
N.B. I do have a lot of field data on failures and failure rates. It is
often difficult to grok without having a clear objective in mind. We may
be able to agree on a set of questions which would quantify the need for
your ideas. Feel free to contact me directly.
Thanks. It's not that I have any particular situation where this becomes more
important than usual. It is just a general observation of a behavior which,
in cases where availability is not important, is sub-optimal from a data
safety perspective. The only reason I even brought it up was the focus on
data integrity that we see with ZFS.
In any case, this is a job for ZFS+FMA integration.
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss