Hi List, First of all: S10u4 120011-14
So I have the weird situation. Earlier this week, I finally mirrored up two iSCSI based pools. I had been wanting to do this for some time, because the availability of the data in these pools is important. One pool mirrored just fine, but the other pool is another story. First lesson (I think) is you should scrub your pools, at least those backed by a SAN, before mirroring them. The problem pool was scrubbed about two weeks before I mirrored it, and it was clean. I assumed, wrongly that there were no checksum errors in the time that elapsed. Well guess again. When I mirrored this guy, the source mirror had two checksum errors. Interestingly, the target inherited these errors, and so now both sides of the mirror showed two checksums in the counters. I don't know if this was real, or if the zpool attach operation just incremented the counters on the second half of the mirror. My next mistake was to assume the counters were in error on the second mirror, and so I zeroed out the counters with zpool clear. OK, so now I scrub the pool, and no checksum errors were found on either side of the mirror. Huh?!? What about those two checksum errors on the first mirror. OK, so I run zdb on the pool, and if finds scads of errors: Traversing all blocks to verify checksums and verify nothing leaked ... zdb_blkptr_cb: Got error 50 reading <33, 727252, 0, 4a> -- skipping-- ... and then tons of: Error counts: errno count 50 123 leaked space: vdev 0, offset 0x4deaed800, size 2048 ... OK, this is odd, so I scrub the pool again, and this time it found 4 checksum errors, on the initial mirror, but none on the other mirror. That makes some sense, (though I don't know what changed) so I break the mirror, taking off the original side that has the checksum errs. I then scrub the pool, no errors found. That's good, but just to be sure, I run zdb on it, and it finds tons of the same errors as if found on the original side of the mirror. Argh! In the mean time, I ran 4 passes of format-> analyze -> compare on the initial half of the mirror that had the checksums and it's totally clean hardware wise. So my questions are these: 1) Does zdb leaked space mean trouble with the pool? 2) Is it possible that the errors got injected to the new half of the mirror when I attached it? For now, I'm going to assume that the new half of the mirror is OK, hardware wise. 3) I'm running a scrub and zdb on the other pool that lives on these SAN boxes, cause I want to see if they come up with the same problems. If not, what would be going on with this crazy pool. 4) Can I recover from this without copying the whole pool to new storage? If not, it will be painful for us. We will have to reboot 350 servers and workstations on stale file handles, interrupting 100's of production processes. My user base is loosing faith in my team. Oh sage ones, please advise. Thanks in advance. Jon _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss