On Thu, Jul 31, 2008 at 16:25, Ross <[EMAIL PROTECTED]> wrote: > The problems with zpool status hanging concern me, knowing that I can't hot > plug drives is an issue, and the long resilver times bug is also a potential > problem. I suspect I can work around the hot plug drive bug with a big > warning label on the server, but knowing the pool can hang so easily makes me > worry about how well ZFS will handle other faults. Other hardware-failure type things can cause what appear to be big problems, too. We have a scsi->sata enclosure here with some embedded firmware, and it's connected to a scsi controller on an x4150. I swapped some disks in the enclosure and updated the controller configuration, then rebooted the controller... and the host box died, because ZFS decided that too many disks were unavailable to continue, so it panicked the box. At first I thought this behavior was terrible---my server is down!---but on some reflection, it makes sense: It's better to quit before anything else on the filesystem is corrupted rather than write garbage across a whole pool because of controller failure or something to that effect.
In any case, I thought you'd be interested in this property of zpools. It's not likely to happen in general (especially with DAS and a dumb controller, like you have), and it's better than the alternative of potentially scribbling on a pool, but other services running on the same box could suffer if you were incautious. > On my drive home tonight I was wondering whether I'm going to have to swallow > my pride and order a hardware raid controller for this server, letting that > deal with the drive issues, and just using ZFS as a very basic filesystem. Letting ZFS handle one layer of redundancy is always recommended, if you're going to use it at all. Otherwise it can get into a situation where it finds checksum errors and can't do anything about them. > The question is whether I can make a server I can be confident in. I'm now > planning a very basic OpenSolaris server just using ZFS as a NFS server, is > there anybody out there who can re-assure me that such a server can work well > and handle real life drive failures? We haven't had any "real life" drive failures at work, but at home I took some old flaky IDE drives and put them in a pentium 3 box running Nevada. Several of them were known to cause errors under Linux, so I mirrored them in approximately-the-same-size pairs and set up weekly scrubs. Two drives out of six failed entirely, and were nicely retired, before I gave up on the idea and bought new disks. I didn't lose any data with this scheme, and ZFS told me every once in a while that it had recovered from a checksum error. Good drives are always recommended, of course, but I saw nothing but good behavior with old broken hardware while I was using it. Finally, at work we're switching everything over to ZFS because it's so convenient... but we keep tape backups nonetheless. I strongly recommend having up-to-date backups in any situation, but even more so with ZFS. It's been very reliable for me personally and at work, but I've seen horror stories of corrupt pools from which all data is lost. I'd rather be sitting around the campfire quaking in my boots at story time than have a flashlight pointed at my face doing the telling, if you catch my drift. Will _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss