Scott L. Burson wrote: > Hi, > > This is in build 74, on x64, on a Tyan S2882-D with dual Opteron 275 and 24GB > of ECC DRAM. > > Not an answer, but zfs-discuss is probably the best place to ask, so I've taken the liberty of CCing that list.
> I seem to have lost the entire contents of a ZFS raidz pool. The pool is in > a state where, if ZFS looks at it, I get a kernel panic. To make it possible > to boot the machine, I had to boot into safe mode and rename > `/etc/zfs/zpool.cache' (fortunately, this was my only pool on the machine). > > Okay, from the beginning. I bought the drives in October: three 500GB > Western Digital WD5000ABYS SATA drives, installed them in the box in place of > three 250GB Seagates I had been using, and created the raidz pool. For the > first couple of months everything was hunky dory. Then, a couple of weeks > ago, I moved the machine to a different location in the building, which > wouldn't even be worth mentioning except that that's when I started to have > problems. The first time I powered it up, one of the SATA drives didn't show > up; I reseated the drive connectors and tried again, and it seemed fine. I > thought that was odd, since I hadn't had one of those connectors come loose > on me before, but I scrubbed the pool, cleared the errors on the drive, and > thought that was the end of it. > > It wasn't. `zpool status' continued to report errors, only now they were > write and read errors, and spread across all three drives. I started to copy > the most critical parts of the filesystem contents onto other machines (very > fortunately, as it turned out). After a while, the drive that had previously > not shown up was marked faulted, and the other two were marked degraded. > Then, yesterday, there was a much larger number of errors -- over 3000 read > errors -- on a different drive, and that drive was marked faulted and the > other two (i.e. including the one that had previously been faulted) were > marked degraded. Also, `zpool status' told me I had lost some "files"; these > turned out to be all, or mostly, directories, some containing substantial > trees. > > By this point I had already concluded I was going to have to replace a drive, > and had picked up a replacement. I installed it in place of the drive that > was now marked faulted, and powered up. I was met with repeated panics and > reboots. I managed to copy down part of the backtrace: > > unix:die+c8 > unix:trap+1351 > unix:cmntrap+e9 > unix:mutex_enter+b > zfs:metaslab_free+97 > zfs:zio_dva_free+29 > zfs:zio_next_stage+b3 > zfs:zio_gang_pipeline+?? > > (This may contain typos, and I didn't get the offset on that last frame.) > > At this point I tried replacing the drive I had just removed (removing the > new, blank drive), but that didn't help. So, as mentioned above, I tried > booting into safe mode and renaming `/etc/zfs/zpool.cache' -- just on a > hunch, but I figured there had to be some such way to make ZFS forget about > the pool -- and that allowed me to boot. > > I used good old `format' to run read tests on the drives overnight -- no bad > blocks were detected. > > So, there are a couple lines of discussion here. On the one hand, it seems I > have a hardware problem, but I haven't yet diagnosed it. More on this below. > On the other, even in the face of hardware problems, I have to report some > disappointment with ZFS. I had really been enjoying the warm fuzzy feeling > ZFS gave me (and I was talking it up to my colleagues; I'm the only one here > using it). Now I'm in a worse state than I would probably be with UFS on > RAID, where `fsck' would probably have managed to salvage a lot of the > filesystem (I would certainly be able to mount it! -- unless the drives were > all failing catastrophically, which doesn't seem to be happening). > > One could say, there are two aspects to filesystem robustness: integrity > checking and recovery. ZFS, with its block checksums, gets an A in integrity > checking, but now appears to do very poorly in recovering in the face of > substantial but not total hardware degradation, when that degradation is > sufficiently severe that the redundancy of the pool can't correct for it. > > Perhaps this is a vanishingly rare case and I am just very unlucky. > Nonetheless I would like to make some suggestions. (1) It would still be > nice to have a salvager. (2) I think it would make sense, at least as an > option, to add even more redundancy to ZFS's on-disk layout; for instance, it > could keep copies of all directories. > > Okay, back to my hardware problems. I know you're going to tell me I > probably have a bad power supply, and I can't rule that out, but it's an > expensive PSU and generously sized for the box; and the box had been rock > stable for a good 18 months before this happened. I'm naturally more > inclined to suspect the new components, which are the SATA drives. (I also > have three SCSI drives in the box for /, swap, etc., and they don't seem to > be having any trouble, though I'm not running ZFS on them so maybe I wouldn't > know.) It's definitely not DRAM; it's all ECC and `fmstat' is not reporting > any errors. On the other hand, it's implausible (though not totally so) that > three new drives would all suffer infant mortality at the same time. > Suggestions invited (I haven't been able to get SunVTS to work, alas). > > And, if anyone can tell me how to make this pool mountable again, by manually > fiddling with the superblock or whatever, that would be great (though I'm not > holding my breath). I haven't overwritten the drive contents yet, so this > might conceivably be possible. > > Thanks for your time. > > -- Scott > > > This message posted from opensolaris.org > _______________________________________________ > opensolaris-help mailing list > [EMAIL PROTECTED] > > > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss