Re: [zfs-discuss] more ZFS recovery

Richard Elling Tue, 12 Aug 2008 14:03:12 -0700

Cromar Scott wrote:
> Chris Siebenmann <[EMAIL PROTECTED]>
>
>  I'm not Anton Rang, but:
> | How would you describe the difference between the data recovery
> | utility and ZFS's normal data recovery process?
>
> cks> The data recovery utility should not panic 
> cks> my entire system if it runs into some situation 
> cks> that it utterly cannot handle. Solaris 10 U5 
> cks> kernel ZFS code does not have this property; 
> cks> it is possible to wind up with ZFS pools that 
> cks> will panic your system when you try to touch them.
> ...
>
> I'll go you one worse.  Imagine a Sun Cluster with several resource
> groups and several zpools.  You blow a proc on one of the servers.  As a
> result, the metadata on one of the pools becomes corrupted.
>


This failure mode affects all shared-storage clusters.  I don't see how
ZFS should or should not be any different than raw, UFS, et.al.

> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/046951.html
>
> Now, each of the servers in your cluster attempts to import the
> zpool--and panics.
>
> As a result of a singe part failure on a single server, your entire
> cluster (and all the services on it) are sitting in a smoking heap on
> your machine room floor.
>   

Yes, but your data is corrupted.  If you were my bank, then I would
greatly appreciate you getting the data corrected prior to bringing my
account online.  If you study highly available clusters and services
then you will see many cases where human interaction is preferred to
automation for just such cases.  You will also find that a combination
of shared storage and non-shared storage cluster technology is used
for truly important data.  For example, we would use Solaris Cluster
for the local shared-storage framework and Solaris Cluster Geographic
Edition for a remote site (no shared hardware components with the
local cluster).

> | Nobody thinks that an answer of "sorry, we lost all of your data" is
> | acceptable.  However, there are failures which will result in loss of
> | data no matter how clever the file system is.
>
> cks> The problem is that there are currently ways to 
> cks> make ZFS lose all your data when there are no 
> cks> hardware faults or failures, merely people or
> cks> software mis-handling pools. This is especially 
> cks> frustrating when the only thing that is likely 
> cks> to be corrupted is ZFS metadata and the vast
> cks> majority (or all) of the data in the pool is intact, 
> cks> readable, and so on.
>
> I'm just glad that our pool corruption experience happened during
> testing, and not after the system had gone into production.  Not exactly
> a resume-enhancing experience.
>   

I'm glad you found this in testing.  BTW, what was the root cause?
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] more ZFS recovery

Reply via email to