Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Joseph Zhou
Richard, I have been glancing through the posts, saw more hardware RAID vs 
ZFS discussion, some are very useful.
However, as you adviced me the other day, we should think about the overall 
solution architect, not just the feature itself.

I believe the spirit of ZFS snapshot is more significant than what have been 
discussed, in the rapid (though I don't know if it is stateful today) 
application migration capabilities that enhance overall business continuity, 
hopefully fulfilling the enterprise availability requirements.  I really 
don't think any Hardware RAID with embedded snapshot can do such, and I am 
never IMHO.

One example:
ZFS is used to both capture the guest from a snapshot and move the 
compressed snapshot between servers, not limited to the Sun xVM hypervisor; 
the same approach could be used with respect to hosting Solaris Zones or Sun 
Logical Domains.

Best,
z

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Tim
On Fri, Dec 12, 2008 at 8:16 PM, Jeff Bonwick  wrote:

> > I'm going to pitch in here as devil's advocate and say this is hardly
> > revolution.  99% of what zfs is attempting to do is something NetApp and
> > WAFL have been doing for 15 years+.  Regardless of the merits of their
> > patents and prior art, etc., this is not something revolutionarily new.
>  It
> > may be "revolution" in the sense that it's the first time it's come to
> open
> > source software and been given away, but it's hardly "revolutionary" in
> file
> > systems as a whole.
>
> "99% of what ZFS is attempting to do?"  Hmm, OK -- let's make a list:
>
>end-to-end checksums
>unlimited snapshots and clones
>O(1) snapshot creation
>O(delta) snapshot deletion
>O(delta) incremental generation
>transactionally safe RAID without NVRAM
>variable blocksize
>block-level compression
>dynamic striping
>intelligent prefetch with automatic length and stride detection
>ditto blocks to increase metadata replication
>delegated administration
>scalability to many cores
>scalability to huge datasets
>hybrid storage pools (flash/disk mix) that optimize
> price/performance
>
> How many of those does NetApp have?  I believe the correct answer is 0%.
>
> Jeff


Seriously?  Do you know anything about the NetApp platform?  I'm hoping this
is a genuine question...

Off the top of my head nearly all of them.  Some of them have artificial
limitations because they learned the hard way that if you give customers
enough rope they'll hang themselves.  For instance "unlimited snapshots".
Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
"Why can't I get my space back?"  Oh, just do a snapshot list and figure out
which one is still holding the data.  What?  Your console locks up for 8
hours when you try to list out the snapshots?  Huh... that's weird.

It's sort of like that whole "unlimited filesystems" thing.  Just don't ever
reboot your server, right?  Or "you can have 40pb in one pool!!!".  How do
you back it up?  Oh, just mirror it to another system?  And when you hit a
bug that toasts both of them you can just start restoring from tape for the
next 8 years, right?  Or if by some luck we get a zfsiron, you can walk the
metadata for the next 5 years.

NVRAM has been replaced by flash drives in a ZFS world to get any kind of
performance... so you're trading one high priced storage for another.  Your
snapshot creation and deletion is identical.  Your incremental generations
is identical.  End-to-end checksums?  Yup.

Let's see... they don't have block-level compression, they chose dedup
instead which nets better results.  "Hybrid storage pool" is achieved
through PAM modules.  Outside of that... I don't see ANYTHING in your list
they didn't do first.


--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Jeff Bonwick
> Off the top of my head nearly all of them.  Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they'll hang themselves.  For instance "unlimited snapshots".

Oh, that's precious!  It's not an arbitrary limit, it's a safety feafure!

> Outside of that... I don't see ANYTHING in your list they didn't do first.

Then you don't know ANYTHING about either platform.  Constant-time
snapshots, for example.  ZFS has them;  NetApp's are O(N), where N is
the total number of blocks, because that's how big their bitmaps are.
If you think O(1) is not a revolutionary improvement over O(N),
then not only do you not know much about either snapshot algorithm,
you don't know much about computing.

Sorry, everyone else, for feeding the troll.  Chum the water all you like,
I'm done with this thread.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] help please - The pool metadata is corrupted

2008-12-13 Thread Brett
Well after a couple of weeks of beating my head, i finally got my data back so 
I thought I would post what process recovered it.

I ran the samsung estool utility
ran auto-scan and for each disk that was showing the wrong physical size i :-
chose set max address
chose recover native size

After that when i booted back into solaris format showed the disks being the 
correct size again and i was able to zpool import :-

AVAILABLE DISK SELECTIONS:
   0. c3d0 
  /p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@0,0
   1. c3d1 
  /p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@1,0
   2. c4d1 
  /p...@0,0/pci-...@1f,2/i...@0/c...@1,0
   3. c5d0 
  /p...@0,0/pci-...@1f,2/i...@1/c...@0,0
   4. c5d1 
  /p...@0,0/pci-...@1f,2/i...@1/c...@1,0
   5. c6d0 
  /p...@0,0/pci-...@1f,5/i...@0/c...@0,0
   6. c7d0 
  /p...@0,0/pci-...@1f,5/i...@1/c...@0,0

I will just say though that there is something in zfs which caused this in the 
first place as when i first replaced teh faulty sata controller, only 1 of the 
4 disks showed the incorrect size in format but then as i messed around trying 
to zpool export/import i eventually wound up in the sate that all 4 disks 
showed the wrong size. 

Anyhow, im happy i got it all back working again, and hope this solution 
assists others.

Regards Rep
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Bryan Cantrill

> Seriously?  Do you know anything about the NetApp platform?  I'm hoping this
> is a genuine question...
> 
> Off the top of my head nearly all of them.  Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they'll hang themselves.  For instance "unlimited snapshots".
> Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
> "Why can't I get my space back?"  Oh, just do a snapshot list and figure out
> which one is still holding the data.  What?  Your console locks up for 8
> hours when you try to list out the snapshots?  Huh... that's weird.
> 
> It's sort of like that whole "unlimited filesystems" thing.  Just don't ever
> reboot your server, right?  Or "you can have 40pb in one pool!!!".  How do
> you back it up?  Oh, just mirror it to another system?  And when you hit a
> bug that toasts both of them you can just start restoring from tape for the
> next 8 years, right?  Or if by some luck we get a zfsiron, you can walk the
> metadata for the next 5 years.
> 
> NVRAM has been replaced by flash drives in a ZFS world to get any kind of
> performance... so you're trading one high priced storage for another.  Your
> snapshot creation and deletion is identical.  Your incremental generations
> is identical.  End-to-end checksums?  Yup.
> 
> Let's see... they don't have block-level compression, they chose dedup
> instead which nets better results.  "Hybrid storage pool" is achieved
> through PAM modules.  Outside of that... I don't see ANYTHING in your list
> they didn't do first.

Wow -- I've spoken to many NetApp partisans over the years, but you might
just take the cake.  Of course, most of the people I talk to are actually
_using_ NetApp's technology, a practice that tends to leave even the most
stalwart proponents realistic about the (many) limitations of NetApp's
technology...

For example, take the PAM.  Do you actually have one of these, or are you
basing your thoughts on reading whitepapers?  I ask because (1) they are
horrifically expensive (2) they don't perform that well (especially
considering that they're DRAM!) (3) they're grossly undersized (a 6000
series can still only max out at a paltry 96G -- and that's with virtually
no slots left for I/O) and (4) they're not selling well.  So if you
actually bought a PAM, that already puts you in a razor-thin minority of
NetApp customers (most of whom see through the PAM and recognize it for
the kludge that it is); if you bought a PAM and think that it's somehow a
replacement for the ZFS hybrid storage pool (which has an order of magnitude
more cache), then I'm sure NetApp loves you:  you must be the dumbest,
richest customer that ever fell in their lap!

- Bryan

--
Bryan Cantrill, Sun Microsystems Fishworks.   http://blogs.sun.com/bmc
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Bob Friesenhahn
On Sat, 13 Dec 2008, Tim wrote:
>
> Seriously?  Do you know anything about the NetApp platform?  I'm hoping this
> is a genuine question...

I believe that esteemed Sun engineers like Jeff are quite familiar 
with the NetApp platform.  Besides NetApp being one of the primary 
storage competitors, it is a virtual minefield out there and one must 
take great care not to step on other company's patents.

> Off the top of my head nearly all of them.  Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they'll hang themselves.  For instance "unlimited snapshots".
> Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
> "Why can't I get my space back?"  Oh, just do a snapshot list and figure out
> which one is still holding the data.  What?  Your console locks up for 8
> hours when you try to list out the snapshots?  Huh... that's weird.

I suggest that you retire to the safety of the rubber room while the 
rest of us enjoy these zfs features. By the same measures, you would 
advocate that people should never be allowed to go outside due to the 
wide open spaces.  Perhaps people will wander outside their homes and 
forget how to make it back.  Or perhaps there will be gravity failure 
and some of the people outside will be lost in space.

There is some activity off the starboard bow, perhaps you should check 
it out ...

Bob
==
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] help please - The pool metadata is corrupted

2008-12-13 Thread Bob Friesenhahn
On Sat, 13 Dec 2008, Brett wrote:
>
> I will just say though that there is something in zfs which caused 
> this in the first place as when i first replaced teh faulty sata 
> controller, only 1 of the 4 disks showed the incorrect size in 
> format but then as i messed around trying to zpool export/import i 
> eventually wound up in the sate that all 4 disks showed the wrong 
> size.

ZFS has absolutely nothing to do with the disk sizes reported by 
'format'.  The problem is elsewhere.  Perhaps it is a firmware or 
driver issue.

Bob
==
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] [Fwd: Re: [indiana-discuss] build 100 image-update: cannot boot to previous BEs]

2008-12-13 Thread Sebastien Roy
zfs folks,

I sent the following to indiana-disc...@opensolaris.org, but perhaps
someone here can get to the bottom of this.  Why must zfs trash my
system so often with this hostid nonsense?  How do I recover from this
situation?  (I have no OpenSolaris boot CD with me at the moment, so
zpool import while booted off of the CD isn't an option)

-Seb

 Forwarded Message 
From: Sebastien Roy 
To: david.co...@sun.com
Cc: Indiana Discuss 
Subject: Re: [indiana-discuss] build 100 image-update: cannot boot to
previous BEs
Date: Sat, 13 Dec 2008 10:54:34 -0500

David,

On Thu, 2008-10-30 at 19:06 -0700, david.co...@sun.com wrote:
> > After an image-update to build 100, I can no longer boot to my previous
> > boot environments.  The system successfully boots into build 100, but my
> > build <= 99 boot environments all crash when mounting zfs root like this
> > (pardon the lack of a more detailed stack, I scribbled this on a piece
> > of paper):
> 
> Seb, can you reboot your build 100 BE one additional time?  After you
> do this, the hostid of the system should be restored to what it was
> originally and your build 99 BE should then boot.

While this seemed to work for an update from 99 to 100, I'm having this
same problem again, and this time, it's not resolvable with subsequent
reboots.

The issue is that I had a 2008.11 BE, and created another BE for
testing.  I rebooted over to this "test" BE and bfu'ed it with test
archives.  I can boot this "test" BE just fine, and I'm now done my
testing.  I now can't boot _any_ of my other BE's that were created
prior to the "test" BE, including 2008.11.  They all panic as I
initially described:


mutex_owner_running()
lookuppnat()
vn_removeat()
vn_remove()
zfs'spa_config_write()
zfs'spa_config_sync()
zfs'spa_open_common()
zfs'spa_open()
zfs'dsl_dlobj_to_dsname()
zfs'zfs_parse_bootfs()
zfs'zfs_mountroot()
rootconf()
vfs_mountroot()
main()
_locore_start()

Is there another way to get my 2008.11 BE back?  Is there a bug filed
for this issue, either with ZFS boot, with bfu, or whatever it is that
decides to trash my system?  The issue was originally described as a
"hostid" issue.  Is panicing the best way to handle whatever problem
this is?

Thanks,
-Seb


___
indiana-discuss mailing list
indiana-disc...@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/indiana-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS as a Gateway for a stroage network

2008-12-13 Thread Dak
Hi together,
Currently I am planning a storage network for making backups of several 
servers. At the moment there are several dedicated backup server for it: 4 
nodes; each node is providing 2.5 TB disk space and exporting it with CIFS over 
Ethernet/1 GBIT. Unfortunately this is not a very flexible way of providing 
disk space for backup purpose. The problem: the size of the file server is 
varying and therefore the backup-space is not used very well - both in an 
economic and technical view.
I want to redesign the current architecture and I try to make it more flexible. 
I have the following idea:
1. The 4 Nodes become a storage backend; they provide disk space as an ISCSI 
device.
2. A new Server takes the role of a Gateway to the storage network. It will 
aggregate the several nodes by including the iscsi devices and building a ZFS 
storage pool over them. In this way I reached a big pool of storage. The space 
of this pool could be export with CIFS to the file-servers for making backups.
3. To reach good performance I could establish a dedicated GBIT-Ethernet 
network between the backup nodes and the gateway. In addition the Gateway get 
ISCSI HBA. The gateway should than be connected with the local network with 
several GBIT uplinks.
4. To reach high availability I could build a fail-over cluster of the ZFS 
gateway.

What do you think about this architecture? Could the gateway be a bottleneck? 
Do you have any other ideas or recommendations?

Regards,
Dak
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS as a Gateway for a stroage network

2008-12-13 Thread Dave
Dak wrote:
> Hi together,
> Currently I am planning a storage network for making backups of several 
> servers. At the moment there are several dedicated backup server for it: 4 
> nodes; each node is providing 2.5 TB disk space and exporting it with CIFS 
> over Ethernet/1 GBIT. Unfortunately this is not a very flexible way of 
> providing disk space for backup purpose. The problem: the size of the file 
> server is varying and therefore the backup-space is not used very well - both 
> in an economic and technical view.
> I want to redesign the current architecture and I try to make it more 
> flexible. I have the following idea:
> 1. The 4 Nodes become a storage backend; they provide disk space as an ISCSI 
> device.
> 2. A new Server takes the role of a Gateway to the storage network. It will 
> aggregate the several nodes by including the iscsi devices and building a ZFS 
> storage pool over them. In this way I reached a big pool of storage. The 
> space of this pool could be export with CIFS to the file-servers for making 
> backups.
> 3. To reach good performance I could establish a dedicated GBIT-Ethernet 
> network between the backup nodes and the gateway. In addition the Gateway get 
> ISCSI HBA. The gateway should than be connected with the local network with 
> several GBIT uplinks.
> 4. To reach high availability I could build a fail-over cluster of the ZFS 
> gateway.
> 
> What do you think about this architecture? Could the gateway be a bottleneck? 
> Do you have any other ideas or recommendations?
> 

I have a setup similar to this. The most important thing I can recommend 
is to create a mirrored zpool from the iscsi disks.

-Dave
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS as a Gateway for a stroage network

2008-12-13 Thread Bob Friesenhahn
On Sat, 13 Dec 2008, Dak wrote:

> What do you think about this architecture? Could the gateway be a 
> bottleneck? Do you have any other ideas or recommendations?

You will need to have redundancy somewhere to avoid possible data 
loss.  If redundancy is in the backend, then you should be protected 
from individual disk failure, but it is still possible to lose the 
entire pool if something goes wrong with the frontend pool. Unless you 
export individual backend server disks (or several volumes from a 
larger pool) using iSCSI the problem you may face is the resilver time 
if something goes wrong.  If the size of the backend storage volume is 
too big, then the resilver time will be excessively long.  You don't 
want to have to resilver up to 2.5TB since that might take days.  The 
ideal solution will figure out how to dice up the storage in order to 
minimize the amount of resilvering which much take place if something 
fails.

For performance you want to maximize the number of vdevs.  Simple 
mirroring is likely safest and most performant for your headend server 
with raidz or raidz2 on the backend servers.  Unfortunately, simple 
mirroring will waste half the space.  You could use raidz on the 
headend server to minimize storage space loss but performance will be 
considerably reduced since writes will then be ordered and all of the 
backend servers will need to accept the write before the next write 
can proceed.  Raidz will also reduce resilver performance since data 
has to be requested from all of the backend servers (over slow iSCSI) 
in order to re-construct the data.

If you are able to afford it, you could get rid of the servers you 
were planning to use as backend storage and replace them with cheap 
JBOD storage arrays which are managed directly with ZFS.  This is 
really the ideal solution in order to maximize performance, maximize 
reliability, and minimize resilver time.

Bob
==
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Joseph Zhou
Hi Bob, Tim, Jeff, you are all my friends, and you all know what you are 
talking about.
As a friend, and trusting your personal integrity, I ask you, please, don't 
get mad, enjoy the open discussion.

(ok, ok, O(N) is revolutionary in tech thinking, just not revolutionary in 
end customer value.  And safety features are important in risk management 
for enterprises.)

I have friends at NetApp, and there are people there that I don't give a 
damn.

I am an enterprise architect, I don't care about the little environments 
that can be fulfilled most effectively by any one operating enviornment 
applications. They are not enterprises and are risky in that business model 
in economy downturns.

In that spirit, and looking at the NetApp virtual server support 
architecture, I would say --
as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it 
would make more sense to utilize the file system capabilities with kernal 
integration to hypervisors, in virtual server deployments, instead of 
promoting a storage-device-based file system and data management solution 
(more proprietary at the solution level).

So, in my position, NetApp PiT is not as good as ZFS PiT, because it is too 
far from the hypervisor.
You can support me or attack me with more technical details (if you know 
NetApp is developing an API for all server hypervisors, I don't).
And don't worry, I have the biggest eagle, but so far, no one has been able 
to hurt that.   ;-)

Best,
z

- Original Message - 
From: "Bob Friesenhahn" 
To: "Tim" 
Cc: 
Sent: Saturday, December 13, 2008 11:03 AM
Subject: Re: [zfs-discuss] Split responsibility for data with ZFS


> On Sat, 13 Dec 2008, Tim wrote:
>>
>> Seriously?  Do you know anything about the NetApp platform?  I'm hoping 
>> this
>> is a genuine question...
>
> I believe that esteemed Sun engineers like Jeff are quite familiar
> with the NetApp platform.  Besides NetApp being one of the primary
> storage competitors, it is a virtual minefield out there and one must
> take great care not to step on other company's patents.
>
>> Off the top of my head nearly all of them.  Some of them have artificial
>> limitations because they learned the hard way that if you give customers
>> enough rope they'll hang themselves.  For instance "unlimited snapshots".
>> Do I even need to begin to tell you what a horrible, HORRIBLE idea that 
>> is?
>> "Why can't I get my space back?"  Oh, just do a snapshot list and figure 
>> out
>> which one is still holding the data.  What?  Your console locks up for 8
>> hours when you try to list out the snapshots?  Huh... that's weird.
>
> I suggest that you retire to the safety of the rubber room while the
> rest of us enjoy these zfs features. By the same measures, you would
> advocate that people should never be allowed to go outside due to the
> wide open spaces.  Perhaps people will wander outside their homes and
> forget how to make it back.  Or perhaps there will be gravity failure
> and some of the people outside will be lost in space.
>
> There is some activity off the starboard bow, perhaps you should check
> it out ...
>
> Bob
> ==
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpol mirror creation after non-mirrored zpool is setup

2008-12-13 Thread Mark Dornfeld
I have installed Solaris 10 on a ZFS filesystem that is not mirrored. Since I 
have an identical disk in the machine, I'd like to add that disk to the 
existing pool as a mirror. Can this be done, and if so, how do I do it?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpol mirror creation after non-mirrored zpool is setup

2008-12-13 Thread Jeff Bonwick
On Sat, Dec 13, 2008 at 04:44:10PM -0800, Mark Dornfeld wrote:
> I have installed Solaris 10 on a ZFS filesystem that is not mirrored. Since I 
> have an identical disk in the machine, I'd like to add that disk to the 
> existing pool as a mirror. Can this be done, and if so, how do I do it?

Yes:

# zpool attach   

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Bob Friesenhahn
On Sat, 13 Dec 2008, Joseph Zhou wrote:
>
> In that spirit, and looking at the NetApp virtual server support 
> architecture, I would say --
> as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it 
> would make more sense to utilize the file system capabilities with kernal 
> integration to hypervisors, in virtual server deployments, instead of 
> promoting a storage-device-based file system and data management solution 
> (more proprietary at the solution level).

I am not an enterprise architect but I do agree that when multiple 
client OSs are involved it is still useful if storage looks like a 
legacy disk drive.  Luckly Solaris already offers iSCSI in Solaris 10 
and OpenSolaris is now able to offer high performance fiber channel 
target and fiber channel over ethernet layers on top of reliable ZFS. 
The full benefit of ZFS is not provided, but the storage is 
successfully divorced from the client with a higher degree of data 
reliability and performance than is available from current firmware 
based RAID arrays.

Bob
==
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Anton B. Rang
I wasn't joking, though as is well known, the plural of anecdote is not data.

Both UFS and ZFS, in common with all file system, have design flaws and bugs.

To lose an entire UFS file system (barring the loss of the entire underlying 
storage) requires a great deal of corruption; there are multiple copies of the 
superblock, cylinder headers and their inodes are stored in a regular pattern 
and easily found by recovery tools, and the UFS file system check utility, 
while not perfect, can repair almost any corruption. There are third party 
tools which can perform much more analysis and recovery in a worst-case 
scenario. A single bad bloc

To lose an entire ZFS pool requires that the most recent uberblock, or one of 
the top-level blocks to which it points, be damaged.  There are currently no 
recovery tools (at least, none of which I am aware).

I find it naïve to imagine that Sun customers "expect" their UFS (or other) 
file systems to be unrecoverable. Any case where fsck failed quickly became an 
escalation to the sustaining engineering organization. Restoring from backup is 
almost never a satisfactory answer for a commercial enterprise.

As usual, the disclaimer; I now work for another storage company, and while 
I've been on the teams developing and maintaining a number of commercial file 
systems (including two of Sun's), ZFS has not been one of them.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Anton B. Rang
Some RAID systems compare checksums on reads, though this is usually only for 
RAID-4 configurations (e.g. DataDirect) because of the performance hit 
otherwise.

End-to-end checksums are not yet common. The SCSI committee recently ratified 
T10 DIF, which allows either an operating system or application to supply 
checksums and have them stored and retrieved with data. Oracle has been working 
to add support for this to Linux, and several array and drive vendors have 
committed to implementing it. So one could say that ZFS is ahead of the curve 
here.

ZFS is not particularly revolutionary: software RAID has been around since the 
invention of the term; end-to-end checksums to disk have been used since the 
1960s (though more often in databases, tape, and optical media); WAFL-like file 
structures may pre-date NetApp. It does put these together for the first time 
in a widely available system, though, which is certainly innovative and useful. 
It will be more useful when it has a more complete disaster recovery model than 
'restore from backup.'
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Richard Elling
Anton B. Rang wrote:
> I find it naïve to imagine that Sun customers "expect" their UFS (or other) 
> file systems to be unrecoverable. 

OK, I'll bite.  If we believe the disk vendors who rate their disks as 
having
an unrecoverable error rate of 1 bit per 10^14 bits read, and knowing that
UFS has absolutely no data protection of its data, why would you think that
it is naive to think that a disk system with UFS cannot lose data?  
Rather, I
would say it has a distinctly calculable probability.  Similarly, for 
ZFS, the
checksum is not perfect, so there is a calculable probability that the ZFS
checksum will not detect an unrecoverable (read) error.  The difference is
that the probability that ZFS will not detect an error is considerably 
smaller
than that of UFS (or FAT, or HSFS, or ...)
> Any case where fsck failed quickly became an escalation to the sustaining 
> engineering organization. Restoring from backup is almost never a 
> satisfactory answer for a commercial enterprise.
>   

I agree.  However, I've personally experienced well over 100 fsck failures
over the years, and while I was always unsatisfied, I didn't always lose 
data[1].
When I did lose data, perhaps it was data I could live without, but that 
was my
call.  Would you rather that ZFS should simply say, "hey you lost some 
data, but
we won't tell you where... ?"

[1] once upon a time, I used a [vendor-name-elided] disk for a 2,300 
user e-mail
message store.  I upgraded the OS, which implemented some new SCSI 
options. 
The disk's firmware didn't handle those options properly and would wait 
about
7 hours before corrupting the UFS file system containing the message store,
requiring a full restore.  So, how many shifts do you think it took to 
fail, recover,
and ultimately resolve the disk firmware issue?  Hint: the firmware rev 
arrived via
UPS.

Personally, I'm very glad that a file system has come along that 
verifies data... and
that feature seems to be catching, as other file systems seem to be 
doing the same.
Hopefully, in a few years silent data corruption will be a footnote on 
the lore of
computing.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Richard Elling
Anton B. Rang wrote:
> Some RAID systems compare checksums on reads, though this is usually only for 
> RAID-4 configurations (e.g. DataDirect) because of the performance hit 
> otherwise.
>   

For the record, Solaris had a (mirrored) RAID system which would compare
data from both sides of the mirror upon read.  It never achieved significant
market penetration and was subsequently scrapped.  Many of the reasons that
the market did not accept it are solved by the method used by ZFS, which is
far superior.

> End-to-end checksums are not yet common. The SCSI committee recently ratified 
> T10 DIF, which allows either an operating system or application to supply 
> checksums and have them stored and retrieved with data. Oracle has been 
> working to add support for this to Linux, and several array and drive vendors 
> have committed to implementing it. So one could say that ZFS is ahead of the 
> curve here.
>   

Oracle also has data checksumming enabled by default for later releases.
I look forward to any field data analysis they may publish :-)

> ZFS is not particularly revolutionary: software RAID has been around since 
> the invention of the term; end-to-end checksums to disk have been used since 
> the 1960s (though more often in databases, tape, and optical media); 
> WAFL-like file structures may pre-date NetApp. It does put these together for 
> the first time in a widely available system, though, which is certainly 
> innovative and useful. It will be more useful when it has a more complete 
> disaster recovery model than 'restore from backup.'
>   

If you wish to implement a disaster recovery model, then you should look far
beyond what ZFS (or any file system) can provide.  Effective disaster 
recovery
requires significant attention to process.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss