my point exactly, more below... On Jun 15, 2011, at 8:20 PM, Fred Liu wrote:
>> This is only true if the pool is not protected. Please protect your >> pool with mirroring or raidz*. >> -- richard >> > > Yes. We use a raidz2 without any spares. In theory, with one disk broken, > there should be no problem. But in reality, we saw NFS service interrupted: > > Jun 9 23:28:59 cn03 scsi_vhci: [ID 734749 kern.warning] WARNING: > vhci_scsi_reset 0x1 > Jun 9 23:28:59 cn03 scsi: [ID 365881 kern.info] > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jun 9 23:28:59 cn03 Log info 0x31140000 received for target 11. > Jun 9 23:28:59 cn03 scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc This message is from the disk saying that it aborted a command. These are usually preceded by a reset, as shown here. What caused the reset condition? Was it actually target 11 or did target 11 get caught up in the reset storm? > > .... > .... > Truncating similar scsi error > .... > .... > > > Jun 10 08:04:38 cn03 svc.startd[9]: [ID 122153 daemon.warning] > svc:/network/nfs/server:default: Method or service exit timed out. Killing > contract 71840. > Jun 10 08:04:38 cn03 svc.startd[9]: [ID 636263 daemon.warning] > svc:/network/nfs/server:default: Method "/lib/svc/method/nfs-server stop 105" > failed due to signal KILL. > > .... > .... > Truncating scsi similar error > .... > .... > > Jun 10 09:04:38 cn03 svc.startd[9]: [ID 122153 daemon.warning] > svc:/network/nfs/server:default: Method or service exit timed out. Killing > contract 71855. > Jun 10 09:04:38 cn03 svc.startd[9]: [ID 636263 daemon.warning] > svc:/network/nfs/server:default: Method "/lib/svc/method/nfs-server stop 105" > failed due to signal KILL. > > This is out of my original assumption when I designed this file box. > But this NFS interruption may **NOT** be due to the degraded zpool although > one broken disk is almost the only **obvious** event in the night. > I will add a hot spare and enable autoreplace to see if it will happen again. Hot spare will not help you here. The problem is not constrained to one disk. In fact, a hot spare may be the worst thing here because it can kick in for the disk complaining about a clogged expander or spurious resets. This causes a resilver that reads from the actual broken disk, that causes more resets, that kicks out another disk that causes a resilver, and so on. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss