[zfs-discuss] Inconcistancies with scrub and zdb

Jonathan Loran Sun, 04 May 2008 00:01:56 -0700

Hi List,

First of all:  S10u4 120011-14


So I have the weird situation.  Earlier this week, I finally mirrored up 
two iSCSI based pools.  I had been wanting to do this for some time, 
because the availability of the data in these pools is important. One 
pool mirrored just fine, but the other pool is another story.

First lesson (I think) is you should scrub your pools, at least those 
backed by a SAN, before mirroring them.  The problem pool was scrubbed 
about two weeks before I mirrored it, and it was clean. I assumed, 
wrongly that there were no checksum errors in the time that elapsed.  
Well guess again.  When I mirrored this guy, the source mirror had two 
checksum errors.  Interestingly, the target inherited these errors, and 
so now both sides of the mirror showed two checksums in the counters.  I 
don't know if this was real, or if the zpool attach operation just 
incremented the counters on the second half of the mirror.

My next mistake was to assume the counters were in error on the second 
mirror, and so I zeroed out the counters with zpool clear.  OK, so now I 
scrub the pool, and no checksum errors were found on either side of the 
mirror.  Huh?!?  What about those two checksum errors on the first 
mirror.  OK, so I run zdb on the pool, and if finds scads of errors:

Traversing all blocks to verify checksums and verify nothing leaked ...

zdb_blkptr_cb: Got error 50 reading <33, 727252, 0, 4a> -- skipping--
...

and then tons of:

Error counts:
errno count
50 123
leaked space: vdev 0, offset 0x4deaed800, size 2048
...


OK, this is odd, so I scrub the pool again, and this time it found 4 
checksum errors, on the initial mirror, but none on the other mirror. 
That makes some sense, (though I don't know what changed) so I break the 
mirror, taking off the original side that has the checksum errs. I then 
scrub the pool, no errors found. That's good, but just to be sure, I run 
zdb on it, and it finds tons of the same errors as if found on the 
original side of the mirror. Argh!

In the mean time, I ran 4 passes of format-> analyze -> compare on the 
initial half of the mirror that had the checksums and it's totally clean 
hardware wise.

So my questions are these:

1) Does zdb leaked space mean trouble with the pool?
2) Is it possible that the errors got injected to the new half of the 
mirror when I attached it? For now, I'm going to assume that the new 
half of the mirror is OK, hardware wise. 
3) I'm running a scrub and zdb on the other pool that lives on these SAN 
boxes, cause I want to see if they come up with the same problems. If 
not, what would be going on with this crazy pool.
4) Can I recover from this without copying the whole pool to new 
storage? If not, it will be painful for us. We will have to reboot 350 
servers and workstations on stale file handles, interrupting 100's of 
production processes. My user base is loosing faith in my team.

Oh sage ones, please advise. Thanks in advance.

Jon


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Inconcistancies with scrub and zdb

Reply via email to