my point exactly, more below...

On Jun 15, 2011, at 8:20 PM, Fred Liu wrote:

>> This is only true if the pool is not protected. Please protect your
>> pool with mirroring or raidz*.
>> -- richard
>> 
> 
> Yes. We use a raidz2 without any spares. In theory, with one disk broken,
> there should be no problem. But in reality, we saw NFS service interrupted:
> 
> Jun  9 23:28:59 cn03 scsi_vhci: [ID 734749 kern.warning] WARNING: 
> vhci_scsi_reset 0x1
> Jun  9 23:28:59 cn03 scsi: [ID 365881 kern.info] 
> /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
> Jun  9 23:28:59 cn03    Log info 0x31140000 received for target 11.
> Jun  9 23:28:59 cn03    scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc

This message is from the disk saying that it aborted a command. These are
usually preceded by a reset, as shown here. What caused the reset condition?
Was it actually target 11 or did target 11 get caught up in the reset storm?

> 
> ....
> ....
> Truncating similar scsi error
> ....
> ....
> 
> 
> Jun 10 08:04:38 cn03 svc.startd[9]: [ID 122153 daemon.warning] 
> svc:/network/nfs/server:default: Method or service exit timed out.  Killing 
> contract 71840.
> Jun 10 08:04:38 cn03 svc.startd[9]: [ID 636263 daemon.warning] 
> svc:/network/nfs/server:default: Method "/lib/svc/method/nfs-server stop 105" 
> failed due to signal KILL.
> 
> ....
> ....
> Truncating scsi similar error
> ....
> ....
> 
> Jun 10 09:04:38 cn03 svc.startd[9]: [ID 122153 daemon.warning] 
> svc:/network/nfs/server:default: Method or service exit timed out.  Killing 
> contract 71855.
> Jun 10 09:04:38 cn03 svc.startd[9]: [ID 636263 daemon.warning] 
> svc:/network/nfs/server:default: Method "/lib/svc/method/nfs-server stop 105" 
> failed due to signal KILL.
> 
> This is out of my original assumption when I designed this file box.
> But this NFS interruption may **NOT** be due to the degraded zpool although 
> one broken disk is almost the only **obvious** event in the night.
> I will add a hot spare and enable autoreplace to see if it will happen again.

Hot spare will not help you here. The problem is not constrained to one disk.
In fact, a hot spare may be the worst thing here because it can kick in for the 
disk
complaining about a clogged expander or spurious resets.  This causes a resilver
that reads from the actual broken disk, that causes more resets, that kicks out 
another
disk that causes a resilver, and so on.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to