Re: [zfs-discuss] Split responsibility for data with ZFS
Richard, I have been glancing through the posts, saw more hardware RAID vs ZFS discussion, some are very useful. However, as you adviced me the other day, we should think about the overall solution architect, not just the feature itself. I believe the spirit of ZFS snapshot is more significant than what have been discussed, in the rapid (though I don't know if it is stateful today) application migration capabilities that enhance overall business continuity, hopefully fulfilling the enterprise availability requirements. I really don't think any Hardware RAID with embedded snapshot can do such, and I am never IMHO. One example: ZFS is used to both capture the guest from a snapshot and move the compressed snapshot between servers, not limited to the Sun xVM hypervisor; the same approach could be used with respect to hosting Solaris Zones or Sun Logical Domains. Best, z ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
On Fri, Dec 12, 2008 at 8:16 PM, Jeff Bonwick wrote: > > I'm going to pitch in here as devil's advocate and say this is hardly > > revolution. 99% of what zfs is attempting to do is something NetApp and > > WAFL have been doing for 15 years+. Regardless of the merits of their > > patents and prior art, etc., this is not something revolutionarily new. > It > > may be "revolution" in the sense that it's the first time it's come to > open > > source software and been given away, but it's hardly "revolutionary" in > file > > systems as a whole. > > "99% of what ZFS is attempting to do?" Hmm, OK -- let's make a list: > >end-to-end checksums >unlimited snapshots and clones >O(1) snapshot creation >O(delta) snapshot deletion >O(delta) incremental generation >transactionally safe RAID without NVRAM >variable blocksize >block-level compression >dynamic striping >intelligent prefetch with automatic length and stride detection >ditto blocks to increase metadata replication >delegated administration >scalability to many cores >scalability to huge datasets >hybrid storage pools (flash/disk mix) that optimize > price/performance > > How many of those does NetApp have? I believe the correct answer is 0%. > > Jeff Seriously? Do you know anything about the NetApp platform? I'm hoping this is a genuine question... Off the top of my head nearly all of them. Some of them have artificial limitations because they learned the hard way that if you give customers enough rope they'll hang themselves. For instance "unlimited snapshots". Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? "Why can't I get my space back?" Oh, just do a snapshot list and figure out which one is still holding the data. What? Your console locks up for 8 hours when you try to list out the snapshots? Huh... that's weird. It's sort of like that whole "unlimited filesystems" thing. Just don't ever reboot your server, right? Or "you can have 40pb in one pool!!!". How do you back it up? Oh, just mirror it to another system? And when you hit a bug that toasts both of them you can just start restoring from tape for the next 8 years, right? Or if by some luck we get a zfsiron, you can walk the metadata for the next 5 years. NVRAM has been replaced by flash drives in a ZFS world to get any kind of performance... so you're trading one high priced storage for another. Your snapshot creation and deletion is identical. Your incremental generations is identical. End-to-end checksums? Yup. Let's see... they don't have block-level compression, they chose dedup instead which nets better results. "Hybrid storage pool" is achieved through PAM modules. Outside of that... I don't see ANYTHING in your list they didn't do first. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
> Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they'll hang themselves. For instance "unlimited snapshots". Oh, that's precious! It's not an arbitrary limit, it's a safety feafure! > Outside of that... I don't see ANYTHING in your list they didn't do first. Then you don't know ANYTHING about either platform. Constant-time snapshots, for example. ZFS has them; NetApp's are O(N), where N is the total number of blocks, because that's how big their bitmaps are. If you think O(1) is not a revolutionary improvement over O(N), then not only do you not know much about either snapshot algorithm, you don't know much about computing. Sorry, everyone else, for feeding the troll. Chum the water all you like, I'm done with this thread. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help please - The pool metadata is corrupted
Well after a couple of weeks of beating my head, i finally got my data back so I thought I would post what process recovered it. I ran the samsung estool utility ran auto-scan and for each disk that was showing the wrong physical size i :- chose set max address chose recover native size After that when i booted back into solaris format showed the disks being the correct size again and i was able to zpool import :- AVAILABLE DISK SELECTIONS: 0. c3d0 /p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@0,0 1. c3d1 /p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@1,0 2. c4d1 /p...@0,0/pci-...@1f,2/i...@0/c...@1,0 3. c5d0 /p...@0,0/pci-...@1f,2/i...@1/c...@0,0 4. c5d1 /p...@0,0/pci-...@1f,2/i...@1/c...@1,0 5. c6d0 /p...@0,0/pci-...@1f,5/i...@0/c...@0,0 6. c7d0 /p...@0,0/pci-...@1f,5/i...@1/c...@0,0 I will just say though that there is something in zfs which caused this in the first place as when i first replaced teh faulty sata controller, only 1 of the 4 disks showed the incorrect size in format but then as i messed around trying to zpool export/import i eventually wound up in the sate that all 4 disks showed the wrong size. Anyhow, im happy i got it all back working again, and hope this solution assists others. Regards Rep -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
> Seriously? Do you know anything about the NetApp platform? I'm hoping this > is a genuine question... > > Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they'll hang themselves. For instance "unlimited snapshots". > Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? > "Why can't I get my space back?" Oh, just do a snapshot list and figure out > which one is still holding the data. What? Your console locks up for 8 > hours when you try to list out the snapshots? Huh... that's weird. > > It's sort of like that whole "unlimited filesystems" thing. Just don't ever > reboot your server, right? Or "you can have 40pb in one pool!!!". How do > you back it up? Oh, just mirror it to another system? And when you hit a > bug that toasts both of them you can just start restoring from tape for the > next 8 years, right? Or if by some luck we get a zfsiron, you can walk the > metadata for the next 5 years. > > NVRAM has been replaced by flash drives in a ZFS world to get any kind of > performance... so you're trading one high priced storage for another. Your > snapshot creation and deletion is identical. Your incremental generations > is identical. End-to-end checksums? Yup. > > Let's see... they don't have block-level compression, they chose dedup > instead which nets better results. "Hybrid storage pool" is achieved > through PAM modules. Outside of that... I don't see ANYTHING in your list > they didn't do first. Wow -- I've spoken to many NetApp partisans over the years, but you might just take the cake. Of course, most of the people I talk to are actually _using_ NetApp's technology, a practice that tends to leave even the most stalwart proponents realistic about the (many) limitations of NetApp's technology... For example, take the PAM. Do you actually have one of these, or are you basing your thoughts on reading whitepapers? I ask because (1) they are horrifically expensive (2) they don't perform that well (especially considering that they're DRAM!) (3) they're grossly undersized (a 6000 series can still only max out at a paltry 96G -- and that's with virtually no slots left for I/O) and (4) they're not selling well. So if you actually bought a PAM, that already puts you in a razor-thin minority of NetApp customers (most of whom see through the PAM and recognize it for the kludge that it is); if you bought a PAM and think that it's somehow a replacement for the ZFS hybrid storage pool (which has an order of magnitude more cache), then I'm sure NetApp loves you: you must be the dumbest, richest customer that ever fell in their lap! - Bryan -- Bryan Cantrill, Sun Microsystems Fishworks. http://blogs.sun.com/bmc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
On Sat, 13 Dec 2008, Tim wrote: > > Seriously? Do you know anything about the NetApp platform? I'm hoping this > is a genuine question... I believe that esteemed Sun engineers like Jeff are quite familiar with the NetApp platform. Besides NetApp being one of the primary storage competitors, it is a virtual minefield out there and one must take great care not to step on other company's patents. > Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they'll hang themselves. For instance "unlimited snapshots". > Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? > "Why can't I get my space back?" Oh, just do a snapshot list and figure out > which one is still holding the data. What? Your console locks up for 8 > hours when you try to list out the snapshots? Huh... that's weird. I suggest that you retire to the safety of the rubber room while the rest of us enjoy these zfs features. By the same measures, you would advocate that people should never be allowed to go outside due to the wide open spaces. Perhaps people will wander outside their homes and forget how to make it back. Or perhaps there will be gravity failure and some of the people outside will be lost in space. There is some activity off the starboard bow, perhaps you should check it out ... Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help please - The pool metadata is corrupted
On Sat, 13 Dec 2008, Brett wrote: > > I will just say though that there is something in zfs which caused > this in the first place as when i first replaced teh faulty sata > controller, only 1 of the 4 disks showed the incorrect size in > format but then as i messed around trying to zpool export/import i > eventually wound up in the sate that all 4 disks showed the wrong > size. ZFS has absolutely nothing to do with the disk sizes reported by 'format'. The problem is elsewhere. Perhaps it is a firmware or driver issue. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [Fwd: Re: [indiana-discuss] build 100 image-update: cannot boot to previous BEs]
zfs folks, I sent the following to indiana-disc...@opensolaris.org, but perhaps someone here can get to the bottom of this. Why must zfs trash my system so often with this hostid nonsense? How do I recover from this situation? (I have no OpenSolaris boot CD with me at the moment, so zpool import while booted off of the CD isn't an option) -Seb Forwarded Message From: Sebastien Roy To: david.co...@sun.com Cc: Indiana Discuss Subject: Re: [indiana-discuss] build 100 image-update: cannot boot to previous BEs Date: Sat, 13 Dec 2008 10:54:34 -0500 David, On Thu, 2008-10-30 at 19:06 -0700, david.co...@sun.com wrote: > > After an image-update to build 100, I can no longer boot to my previous > > boot environments. The system successfully boots into build 100, but my > > build <= 99 boot environments all crash when mounting zfs root like this > > (pardon the lack of a more detailed stack, I scribbled this on a piece > > of paper): > > Seb, can you reboot your build 100 BE one additional time? After you > do this, the hostid of the system should be restored to what it was > originally and your build 99 BE should then boot. While this seemed to work for an update from 99 to 100, I'm having this same problem again, and this time, it's not resolvable with subsequent reboots. The issue is that I had a 2008.11 BE, and created another BE for testing. I rebooted over to this "test" BE and bfu'ed it with test archives. I can boot this "test" BE just fine, and I'm now done my testing. I now can't boot _any_ of my other BE's that were created prior to the "test" BE, including 2008.11. They all panic as I initially described: mutex_owner_running() lookuppnat() vn_removeat() vn_remove() zfs'spa_config_write() zfs'spa_config_sync() zfs'spa_open_common() zfs'spa_open() zfs'dsl_dlobj_to_dsname() zfs'zfs_parse_bootfs() zfs'zfs_mountroot() rootconf() vfs_mountroot() main() _locore_start() Is there another way to get my 2008.11 BE back? Is there a bug filed for this issue, either with ZFS boot, with bfu, or whatever it is that decides to trash my system? The issue was originally described as a "hostid" issue. Is panicing the best way to handle whatever problem this is? Thanks, -Seb ___ indiana-discuss mailing list indiana-disc...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/indiana-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS as a Gateway for a stroage network
Hi together, Currently I am planning a storage network for making backups of several servers. At the moment there are several dedicated backup server for it: 4 nodes; each node is providing 2.5 TB disk space and exporting it with CIFS over Ethernet/1 GBIT. Unfortunately this is not a very flexible way of providing disk space for backup purpose. The problem: the size of the file server is varying and therefore the backup-space is not used very well - both in an economic and technical view. I want to redesign the current architecture and I try to make it more flexible. I have the following idea: 1. The 4 Nodes become a storage backend; they provide disk space as an ISCSI device. 2. A new Server takes the role of a Gateway to the storage network. It will aggregate the several nodes by including the iscsi devices and building a ZFS storage pool over them. In this way I reached a big pool of storage. The space of this pool could be export with CIFS to the file-servers for making backups. 3. To reach good performance I could establish a dedicated GBIT-Ethernet network between the backup nodes and the gateway. In addition the Gateway get ISCSI HBA. The gateway should than be connected with the local network with several GBIT uplinks. 4. To reach high availability I could build a fail-over cluster of the ZFS gateway. What do you think about this architecture? Could the gateway be a bottleneck? Do you have any other ideas or recommendations? Regards, Dak -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS as a Gateway for a stroage network
Dak wrote: > Hi together, > Currently I am planning a storage network for making backups of several > servers. At the moment there are several dedicated backup server for it: 4 > nodes; each node is providing 2.5 TB disk space and exporting it with CIFS > over Ethernet/1 GBIT. Unfortunately this is not a very flexible way of > providing disk space for backup purpose. The problem: the size of the file > server is varying and therefore the backup-space is not used very well - both > in an economic and technical view. > I want to redesign the current architecture and I try to make it more > flexible. I have the following idea: > 1. The 4 Nodes become a storage backend; they provide disk space as an ISCSI > device. > 2. A new Server takes the role of a Gateway to the storage network. It will > aggregate the several nodes by including the iscsi devices and building a ZFS > storage pool over them. In this way I reached a big pool of storage. The > space of this pool could be export with CIFS to the file-servers for making > backups. > 3. To reach good performance I could establish a dedicated GBIT-Ethernet > network between the backup nodes and the gateway. In addition the Gateway get > ISCSI HBA. The gateway should than be connected with the local network with > several GBIT uplinks. > 4. To reach high availability I could build a fail-over cluster of the ZFS > gateway. > > What do you think about this architecture? Could the gateway be a bottleneck? > Do you have any other ideas or recommendations? > I have a setup similar to this. The most important thing I can recommend is to create a mirrored zpool from the iscsi disks. -Dave ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS as a Gateway for a stroage network
On Sat, 13 Dec 2008, Dak wrote: > What do you think about this architecture? Could the gateway be a > bottleneck? Do you have any other ideas or recommendations? You will need to have redundancy somewhere to avoid possible data loss. If redundancy is in the backend, then you should be protected from individual disk failure, but it is still possible to lose the entire pool if something goes wrong with the frontend pool. Unless you export individual backend server disks (or several volumes from a larger pool) using iSCSI the problem you may face is the resilver time if something goes wrong. If the size of the backend storage volume is too big, then the resilver time will be excessively long. You don't want to have to resilver up to 2.5TB since that might take days. The ideal solution will figure out how to dice up the storage in order to minimize the amount of resilvering which much take place if something fails. For performance you want to maximize the number of vdevs. Simple mirroring is likely safest and most performant for your headend server with raidz or raidz2 on the backend servers. Unfortunately, simple mirroring will waste half the space. You could use raidz on the headend server to minimize storage space loss but performance will be considerably reduced since writes will then be ordered and all of the backend servers will need to accept the write before the next write can proceed. Raidz will also reduce resilver performance since data has to be requested from all of the backend servers (over slow iSCSI) in order to re-construct the data. If you are able to afford it, you could get rid of the servers you were planning to use as backend storage and replace them with cheap JBOD storage arrays which are managed directly with ZFS. This is really the ideal solution in order to maximize performance, maximize reliability, and minimize resilver time. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
Hi Bob, Tim, Jeff, you are all my friends, and you all know what you are talking about. As a friend, and trusting your personal integrity, I ask you, please, don't get mad, enjoy the open discussion. (ok, ok, O(N) is revolutionary in tech thinking, just not revolutionary in end customer value. And safety features are important in risk management for enterprises.) I have friends at NetApp, and there are people there that I don't give a damn. I am an enterprise architect, I don't care about the little environments that can be fulfilled most effectively by any one operating enviornment applications. They are not enterprises and are risky in that business model in economy downturns. In that spirit, and looking at the NetApp virtual server support architecture, I would say -- as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it would make more sense to utilize the file system capabilities with kernal integration to hypervisors, in virtual server deployments, instead of promoting a storage-device-based file system and data management solution (more proprietary at the solution level). So, in my position, NetApp PiT is not as good as ZFS PiT, because it is too far from the hypervisor. You can support me or attack me with more technical details (if you know NetApp is developing an API for all server hypervisors, I don't). And don't worry, I have the biggest eagle, but so far, no one has been able to hurt that. ;-) Best, z - Original Message - From: "Bob Friesenhahn" To: "Tim" Cc: Sent: Saturday, December 13, 2008 11:03 AM Subject: Re: [zfs-discuss] Split responsibility for data with ZFS > On Sat, 13 Dec 2008, Tim wrote: >> >> Seriously? Do you know anything about the NetApp platform? I'm hoping >> this >> is a genuine question... > > I believe that esteemed Sun engineers like Jeff are quite familiar > with the NetApp platform. Besides NetApp being one of the primary > storage competitors, it is a virtual minefield out there and one must > take great care not to step on other company's patents. > >> Off the top of my head nearly all of them. Some of them have artificial >> limitations because they learned the hard way that if you give customers >> enough rope they'll hang themselves. For instance "unlimited snapshots". >> Do I even need to begin to tell you what a horrible, HORRIBLE idea that >> is? >> "Why can't I get my space back?" Oh, just do a snapshot list and figure >> out >> which one is still holding the data. What? Your console locks up for 8 >> hours when you try to list out the snapshots? Huh... that's weird. > > I suggest that you retire to the safety of the rubber room while the > rest of us enjoy these zfs features. By the same measures, you would > advocate that people should never be allowed to go outside due to the > wide open spaces. Perhaps people will wander outside their homes and > forget how to make it back. Or perhaps there will be gravity failure > and some of the people outside will be lost in space. > > There is some activity off the starboard bow, perhaps you should check > it out ... > > Bob > == > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpol mirror creation after non-mirrored zpool is setup
I have installed Solaris 10 on a ZFS filesystem that is not mirrored. Since I have an identical disk in the machine, I'd like to add that disk to the existing pool as a mirror. Can this be done, and if so, how do I do it? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpol mirror creation after non-mirrored zpool is setup
On Sat, Dec 13, 2008 at 04:44:10PM -0800, Mark Dornfeld wrote: > I have installed Solaris 10 on a ZFS filesystem that is not mirrored. Since I > have an identical disk in the machine, I'd like to add that disk to the > existing pool as a mirror. Can this be done, and if so, how do I do it? Yes: # zpool attach Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
On Sat, 13 Dec 2008, Joseph Zhou wrote: > > In that spirit, and looking at the NetApp virtual server support > architecture, I would say -- > as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it > would make more sense to utilize the file system capabilities with kernal > integration to hypervisors, in virtual server deployments, instead of > promoting a storage-device-based file system and data management solution > (more proprietary at the solution level). I am not an enterprise architect but I do agree that when multiple client OSs are involved it is still useful if storage looks like a legacy disk drive. Luckly Solaris already offers iSCSI in Solaris 10 and OpenSolaris is now able to offer high performance fiber channel target and fiber channel over ethernet layers on top of reliable ZFS. The full benefit of ZFS is not provided, but the storage is successfully divorced from the client with a higher degree of data reliability and performance than is available from current firmware based RAID arrays. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
I wasn't joking, though as is well known, the plural of anecdote is not data. Both UFS and ZFS, in common with all file system, have design flaws and bugs. To lose an entire UFS file system (barring the loss of the entire underlying storage) requires a great deal of corruption; there are multiple copies of the superblock, cylinder headers and their inodes are stored in a regular pattern and easily found by recovery tools, and the UFS file system check utility, while not perfect, can repair almost any corruption. There are third party tools which can perform much more analysis and recovery in a worst-case scenario. A single bad bloc To lose an entire ZFS pool requires that the most recent uberblock, or one of the top-level blocks to which it points, be damaged. There are currently no recovery tools (at least, none of which I am aware). I find it naïve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable. Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise. As usual, the disclaimer; I now work for another storage company, and while I've been on the teams developing and maintaining a number of commercial file systems (including two of Sun's), ZFS has not been one of them. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise. End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here. ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than 'restore from backup.' -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
Anton B. Rang wrote: > I find it naïve to imagine that Sun customers "expect" their UFS (or other) > file systems to be unrecoverable. OK, I'll bite. If we believe the disk vendors who rate their disks as having an unrecoverable error rate of 1 bit per 10^14 bits read, and knowing that UFS has absolutely no data protection of its data, why would you think that it is naive to think that a disk system with UFS cannot lose data? Rather, I would say it has a distinctly calculable probability. Similarly, for ZFS, the checksum is not perfect, so there is a calculable probability that the ZFS checksum will not detect an unrecoverable (read) error. The difference is that the probability that ZFS will not detect an error is considerably smaller than that of UFS (or FAT, or HSFS, or ...) > Any case where fsck failed quickly became an escalation to the sustaining > engineering organization. Restoring from backup is almost never a > satisfactory answer for a commercial enterprise. > I agree. However, I've personally experienced well over 100 fsck failures over the years, and while I was always unsatisfied, I didn't always lose data[1]. When I did lose data, perhaps it was data I could live without, but that was my call. Would you rather that ZFS should simply say, "hey you lost some data, but we won't tell you where... ?" [1] once upon a time, I used a [vendor-name-elided] disk for a 2,300 user e-mail message store. I upgraded the OS, which implemented some new SCSI options. The disk's firmware didn't handle those options properly and would wait about 7 hours before corrupting the UFS file system containing the message store, requiring a full restore. So, how many shifts do you think it took to fail, recover, and ultimately resolve the disk firmware issue? Hint: the firmware rev arrived via UPS. Personally, I'm very glad that a file system has come along that verifies data... and that feature seems to be catching, as other file systems seem to be doing the same. Hopefully, in a few years silent data corruption will be a footnote on the lore of computing. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
Anton B. Rang wrote: > Some RAID systems compare checksums on reads, though this is usually only for > RAID-4 configurations (e.g. DataDirect) because of the performance hit > otherwise. > For the record, Solaris had a (mirrored) RAID system which would compare data from both sides of the mirror upon read. It never achieved significant market penetration and was subsequently scrapped. Many of the reasons that the market did not accept it are solved by the method used by ZFS, which is far superior. > End-to-end checksums are not yet common. The SCSI committee recently ratified > T10 DIF, which allows either an operating system or application to supply > checksums and have them stored and retrieved with data. Oracle has been > working to add support for this to Linux, and several array and drive vendors > have committed to implementing it. So one could say that ZFS is ahead of the > curve here. > Oracle also has data checksumming enabled by default for later releases. I look forward to any field data analysis they may publish :-) > ZFS is not particularly revolutionary: software RAID has been around since > the invention of the term; end-to-end checksums to disk have been used since > the 1960s (though more often in databases, tape, and optical media); > WAFL-like file structures may pre-date NetApp. It does put these together for > the first time in a widely available system, though, which is certainly > innovative and useful. It will be more useful when it has a more complete > disaster recovery model than 'restore from backup.' > If you wish to implement a disaster recovery model, then you should look far beyond what ZFS (or any file system) can provide. Effective disaster recovery requires significant attention to process. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss