On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn <bfrie...@simple.dallas.tx.us> wrote: > On Fri, 13 Feb 2009, Ross wrote: >> >> Something like that will have people praising ZFS' ability to safeguard >> their data, and the way it recovers even after system crashes or when >> hardware has gone wrong. You could even have a "common causes of this >> are..." message, or a link to an online help article if you wanted people to >> be really impressed. > > I see a career in politics for you. Barring an operating system > implementation bug, the type of problem you are talking about is due to > improperly working hardware. Irreversibly reverting to a previous > checkpoint may or may not obtain the correct data. Perhaps it will produce > a bunch of checksum errors.
Yes, the root cause is improperly working hardware (or an OS bug like 6424510), but with ZFS being a copy on write system, when errors occur with a recent write, for the vast majority of the pools out there you still have huge amounts of data that is still perfectly valid and should be accessible. Unless I'm misunderstanding something, reverting to a previous checkpoint gets you back to a state where ZFS knows it's good (or at least where ZFS can verify whether it's good or not). You have to consider that even with improperly working hardware, ZFS has been checksumming data, so if that hardware has been working for any length of time, you *know* that the data on it is good. Yes, if you have databases or files there that were mid-write, they will almost certainly be corrupted. But at least your filesystem is back, and it's in as good a state as it's going to be given that in order for your pool to be in this position, your hardware went wrong mid-write. And as an added bonus, if you're using ZFS snapshots, now your pool is accessible, you have a bunch of backups available so you can probably roll corrupted files back to working versions. For me, that is about as good as you can get in terms of handling a sudden hardware failure. Everything that is known to be saved to disk is there, you can verify (with absolute certainty) whether data is ok or not, and you have backup copies of damaged files. In the old days you'd need to be reverting to tape backups for both of these, with potentially hours of downtime before you even know where you are. Achieving that in a few seconds (or minutes) is a massive step forwards. > There are already people praising ZFS' ability to safeguard their data, and > the way it recovers even after system crashes or when hardware has gone > wrong. Yes there are, but the majority of these are praising the ability of ZFS checksums to detect bad data, and to repair it when you have redundancy in your pool. I've not seen that many cases of people praising ZFS' recovery ability - uberblock problems seem to have a nasty habit of leaving you with tons of good, checksummed data on a pool that you can't get to, and while many hardware problems are dealt with, others can hang your entire pool. > > Bob > ====================================== > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss