Claus Guttesen wrote:
>> | How would you describe the difference between the data recovery
>> | utility and ZFS's normal data recovery process?
>>
>>  The data recovery utility should not panic my entire system if it runs
>> into some situation that it utterly cannot handle. Solaris 10 U5 kernel
>> ZFS code does not have this property; it is possible to wind up with ZFS
>> pools that will panic your system when you try to touch them.
>>     
>
> I do agree. The last three weeks I have been testing an
> arc-1680-sas-controller with an external cabinet with 16 sas-disk at 1
> TB. The server is a E5405 quad-core with 8 GB RAM. Setting the card to
> jbod-mode gave me a somewhat unstable setup where the disks would stop
> responding. After I had put all the disk on the controller in
> passthrough-mode the setup did stabilize and I was able to copy 3.8 of
> 4 TB of small files when some of the disks bailed out. I brought the
> disks back online and restarted the server. A zpool status that show
> online disks also told me:
>
> errors: Permanent errors have been detected in the following files:
>         ef1/image/z018_16:<0x0>
>
> The zpool consisted of five disks in three seperate raidz in one zpool
> including one spare.
>
> The only time I've experienced that the server could not get the zpool
> back online was when the disks for some reason failed. I find it
> completely valid that the server panics rather than write inconsisten
> data.
>
> Everytime when our internal file-server suffered a unplanned restart
> (power failure) it always recovered (solaris 10/08 and zfs ver. 4).
> But this Sunday Aug. the 10'th the same file-server was brought down
> by a faulty UPS. When power was restored the zpool had become
> inconsistent. This time the storage was also affected by the
> power-outage.
>
> Is it a valid point that zfs is able to recover more gracefully when
> the server itself goes down rather than when some of the disks/LUN's
> bails out? The reason I ask is because that is the only time I've
> personally seen zfs unable to recover.
>   

Later versions of ZFS, not yet in Solaris 10, are much more tolerant
of disappearing storage.  Solaris 10 update 6 should contain these
features later this year.  OpenSolaris 2008.05 and SXCE b72 or later
already have these features.

There is a failure mode that we worry about: ZFS depends on the disk
actually writing (flushing) data to nonvolatile storage when ZFS issues
the flush request.  If that does not actually occur, then you may see the
problems you describe.  While ZFS distrusts storage better than most
file systems, it must still trust a flush request.

>>  The data recovery utility can ask me questions about what I want it
>> to do in an ambiguous situation, or give me only partial results.
>>     
>
> Our nfs-server was also on this faulty UPS. This is running solaris 9
> on sparc with vxfs and is managing 109 TB of storage on a HDS. When I
> switched on the server I saw the it replayed the journal and marked
> the partition as clean and came online. I know that there is no
> guarantee that the data are consistent but at least vxfs have had many
> years to mature.
>
> I had initially planned to migrate some of the older partitions to zfs
> and thereby test it. But I've changed that and will try the setup with
> the arc-1680-controller and sas-disks as an internal file-server
> instead for a while and rather add additional storage to our solaris
> 9-server and vxfs.
>
> Zfs have changed the way I look at filesystems and I'm very glad that
> Sun gives it so much exposure. But atm. I'd give vxfs the edge. :-)
>
>   

I've had excellent experiences with Sun-branded HDS storage:
rock solid.

For flaky hardware that seems to lose data during a power outage,
I'd prefer a file system that can detect that my data is corrupted.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to