Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

Stephan Budach Sun, 20 Jan 2013 11:51:54 -0800

Am 20.01.13 16:51, schrieb Edward Ned Harvey(opensolarisisdeadlongliveopensolaris):

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Stephan Budach


I am always experiencing chksum errors while scrubbing my zpool(s), but
I never experienced chksum errors while resilvering. Does anybody know
why that would be?

When you resilver, you're not reading all the data on all the drives.  Only 
just enough to resilver, which doesn't include all the data that was previously 
in-sync (maybe a little of it, but mostly not).  Even if you have a completely 
failed drive, replaced with a completely new empty drive, if you have a 3-way 
mirror, you only need to read one good copy of the data in order to write the 
resilver'd data onto the new drive.  So you could still be failing to detect 
cksum errors on the *other* side of the mirror, which wasn't read during the 
resilver.

What's more, when you resilver, the system is just going to write the target 
disk.  Not go back and verify every written block of the target disk.

So, think of a scrub as a "complete, thorough, resilver" whereas "resilver" is 
just a lightweight version, doing only the parts that are known to be out-of sync, and without 
subsequent read verification.

Well, I always used to issue a scrub after resilver, but since wecompletely "re-designed" our server room, things started to act up andeach scrub would at least come up with chksum errors. On the Fire 4170 Ionly noticed these chksum errors, while on the Dell sometimes the wholething broke down and ZFS would mark numerous disks as faulted.

This happens on all of my servers, Sun Fire 4170M2,
Dell PE 650 and on any FC storage that I have.

While you apparently have been able to keep the system in production for a 
while, consider yourself lucky.  You have a real problem, and solving it 
probably won't be easy.  Your problem is either hardware, firmware, or drivers. 
 If you have a support contract on the Sun, I would recommend starting there.  
Because the Dell is definitely a configuration that you won't find official 
support for - just a lot of community contributors, who will likely not provide 
a super awesome answer for you super soon.

I know, I dedicated quite some of my time to keep this setup up andrunning. I do have support coverage for my two Sun Solaris servers, butas you may have experienced as well, you're sometimes better off askinghere first… ;)

I have gone over our SAN setup/topology and maybe I have found at leatsone issue worth looking at: we do have five QLogic 5600 SanBoxes and oneof then basically operates as a core switch, were all other ISLs arehooked up, That is, this switch has 4 ISLs and 12 storage arrayconnects, while the Dell sits on another Sanbox and thus all traffic isrouted through that switch.

I don't know, but maybe this a bit too much for this setup and the Dellhosts around 240 drives, which are mostly located on a neighbour switch.I will try and tweak this setup such as that the Dell gets a connectionon that Sanbox directly which will vastly reduce the inter-switch-traffic.

I am also seeing these warnings in /var/adm/messages on either the Delland the my new Sun Server X2:

Jan 20 18:22:10 solaris11b scsi: [ID 243001 kern.warning] WARNING:/pci@0,0/pci8086,3c08@3/pci1077,171@0,1/fp@0,0 (fcp0):Jan 20 18:22:10 solaris11b SCSI command to d_id=0x10601 lun=0x0failed, Bad FCP response values: rsvd1=0, rsvd2=0, sts-rsvd1=0,sts-rsvd2=0, rsplen=0, senselen=0Jan 20 18:22:10 solaris11b scsi: [ID 243001 kern.warning] WARNING:/pci@0,0/pci8086,3c08@3/pci1077,171@0,1/fp@0,0 (fcp0):Jan 20 18:22:10 solaris11b SCSI command to d_id=0x30e01 lun=0x1failed, Bad FCP response values: rsvd1=0, rsvd2=0, sts-rsvd1=0,sts-rsvd2=0, rsplen=0, senselen=0


These are always targeted at LUNs on a remote Sanboxes…



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

Reply via email to