Am 19.01.13 18:17, schrieb Bob Friesenhahn:
On Sat, 19 Jan 2013, Stephan Budach wrote:
Now, this zpool is made of 3-way mirrors and currently 13 out of 15
vdevs are resilvering (which they had gone through yesterday as well)
and I never got any error while resilvering. I have been all over the
setup to find any glitch or bad part, but I couldn't come up with
anything significant.
Doesn't this sound improbable, wouldn't one expect to encounter other
chksum errors while resilvering is running?
I can't attest to chksum errors since I have yet to see one on my
machines (have seen several complete disk failures, or disks faulted
by the system though). Checksum errors are bad and not seeing them
should be the normal case.
I know and it's really bugging me, that I seem to have these chksum
errors on all of my machines, be it Sun gear or Dell.
Resilver may in fact be just verifying that the pool disks are
coherent via metadata. This might happen if the fiber channel is
flapping.
Regarding the dire fiber channel issue, are you using fiber channel
switches or direct connections to the storage array(s)? If you are
using switches, are they stable or are they doing something terrible
like resetting? Do you have duplex connectivity? Have you verified
that your FC HBA's firmware is correct?
Looking on my FC switches, I am noticing such errors like these:
[656][Thu Dec 06 03:33:04.795 UTC 2012][I][8600.001E][Port][Port:
2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged out of nameserver.]
[657][Thu Dec 06 03:33:05.829 UTC 2012][I][8600.0020][Port][Port:
2][SYNC_LOSS]
[658][Thu Dec 06 03:37:08.077 UTC 2012][I][8600.001F][Port][Port:
2][SYNC_ACQ]
[659][Thu Dec 06 03:37:10.582 UTC 2012][I][8600.001D][Port][Port:
2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged into nameserver.]
[660][Sun Dec 09 04:18:32.324 UTC 2012][I][8600.001E][Port][Port:
10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged out of
nameserver.]
[661][Sun Dec 09 04:18:32.326 UTC 2012][I][8600.0020][Port][Port:
10][SYNC_LOSS]
[662][Sun Dec 09 04:18:32.913 UTC 2012][I][8600.001F][Port][Port:
10][SYNC_ACQ]
[663][Sun Dec 09 04:18:33.024 UTC 2012][I][8600.001D][Port][Port:
10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged into nameserver.]
Just ignore the timestamp, as it seems that the time is not set
correctly, but the dates match my two issues from today and thursday,
which accounts for three days. I didn't catch that before, but it seems
to clearly indicate a problem with the FC connection…
But, what do I make of this information?
Did you check for messages in /var/adm/messages which might indicate
when and how FC connectivity has been lost?
Well, this is the most scaring part to me. Neither fmdump nor dmesg
showed anything that would indicate a connectivity issue - at least not
the last time.
Bob
Thanks,
Stephan
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss