Re: [zfs-discuss] zfs global hot spares?

Richard Elling Sat, 18 Jun 2011 16:33:39 -0700

more below...

On Jun 16, 2011, at 2:27 AM, Fred Liu wrote:


> Fixing a typo in my last thread...
> 
>> -----Original Message-----
>> From: Fred Liu
>> Sent: 星期四, 六月 16, 2011 17:22
>> To: 'Richard Elling'
>> Cc: Jim Klimov; zfs-discuss@opensolaris.org
>> Subject: RE: [zfs-discuss] zfs global hot spares?
>> 
>>> This message is from the disk saying that it aborted a command. These
>>> are
>>> usually preceded by a reset, as shown here. What caused the reset
>>> condition?
>>> Was it actually target 11 or did target 11 get caught up in the reset
>>> storm?
>>> 
>> 
> It happed in the mid-night and nobody touched the file box.
> I assume it is the transition status before the disk is *thoroughly*
> damaged:
> 
> Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
> 8000-FD, TYPE: Fault, VER: 1, SEVERITY:
> 
> Major
> Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011
> Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890,
> HOSTNAME: cn03
> Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0
> Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e
> Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a
> ZFS device exceeded
> Jun 10 09:34:11 cn03         acceptable levels.  Refer to
> http://sun.com/msg/ZFS-8000-FD for more information.
> Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and
> marked as faulted.  An attempt
> Jun 10 09:34:11 cn03         will be made to activate a hot spare if
> available.
> Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be
> compromised.
> Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the
> bad device.

zpool status -x output would be useful. These error reports do not include a
pointer to the faulty device. fmadm can also give more info.

> 
> After I rebooted it, I got:
> Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release
> 5.11 Version snv_134 64-bit
> Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983-
> 2010 Sun Microsystems, Inc.  All rights
> 
> reserved.
> Jun 10 11:38:49 cn03 Use is subject to license terms.
> Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features:
> 
> 7f7fffff<sse4_2,sse4_1,ssse3,cpuid,mwait,tscp,cmp,cx16,sse3,nx,asysc,ht
> t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d
> 
> e,pge,mtrr,msr,tsc,lgpg>
> 
> Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info]
> /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
> Jun 10 11:39:06 cn03    mptsas0 unrecognized capability 0x3
> 
> Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:42 cn03    drive offline
> Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:47 cn03    drive offline
> Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:52 cn03    drive offline
> Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:57 cn03    drive offline

mpathadm can be used to determine the device paths for this disk.

Notice how the disk is offline at multiple times. There is some sort of 
recovery going on here that continues to fail later. I call these "wounded
soldiers" because they take a lot more care than a dead soldier. You
would be better off if the drive completely died.

>> 
>> 
>>> 
>>> Hot spare will not help you here. The problem is not constrained to
>> one
>>> disk.
>>> In fact, a hot spare may be the worst thing here because it can kick
>> in
>>> for the disk
>>> complaining about a clogged expander or spurious resets.  This causes
>> a
>>> resilver
>>> that reads from the actual broken disk, that causes more resets, that
>>> kicks out another
>>> disk that causes a resilver, and so on.
>>> -- richard
>>> 
>> 
> So the warm spares could be "better" choice under this situation?
> BTW, in what condition, the scsi reset storm will happen?

In my experience they start randomly and in some cases are not reproducible.

> How can we be immune to this so as NOT to interrupt the file
> service?

Are you asking for fault tolerance?  If so, then you need a fault tolerant 
system like
a Tandem. If you are asking for a way to build a cost effective solution using 
commercial, off-the-shelf (COTS) components, then that is far beyond what can 
be easily
said in a forum posting.
 -- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs global hot spares?

Reply via email to