Re: [zfs-discuss] zfs global hot spares?

Fred Liu Thu, 16 Jun 2011 02:28:57 -0700

Fixing a typo in my last thread...


> -----Original Message-----
> From: Fred Liu
> Sent: 星期四, 六月 16, 2011 17:22
> To: 'Richard Elling'
> Cc: Jim Klimov; zfs-discuss@opensolaris.org
> Subject: RE: [zfs-discuss] zfs global hot spares?
> 
> > This message is from the disk saying that it aborted a command. These
> > are
> > usually preceded by a reset, as shown here. What caused the reset
> > condition?
> > Was it actually target 11 or did target 11 get caught up in the reset
> > storm?
> >
> 
 It happed in the mid-night and nobody touched the file box.
 I assume it is the transition status before the disk is *thoroughly*
 damaged:
 
 Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
 8000-FD, TYPE: Fault, VER: 1, SEVERITY:
 
 Major
 Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011
 Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890,
 HOSTNAME: cn03
 Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0
 Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e
 Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a
 ZFS device exceeded
 Jun 10 09:34:11 cn03         acceptable levels.  Refer to
 http://sun.com/msg/ZFS-8000-FD for more information.
 Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and
 marked as faulted.  An attempt
 Jun 10 09:34:11 cn03         will be made to activate a hot spare if
 available.
 Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be
 compromised.
 Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the
 bad device.
 
 After I rebooted it, I got:
 Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release
 5.11 Version snv_134 64-bit
 Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983-
 2010 Sun Microsystems, Inc.  All rights
 
 reserved.
 Jun 10 11:38:49 cn03 Use is subject to license terms.
 Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features:
 
 7f7fffff<sse4_2,sse4_1,ssse3,cpuid,mwait,tscp,cmp,cx16,sse3,nx,asysc,ht
 t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d
 
 e,pge,mtrr,msr,tsc,lgpg>
 
 Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info]
 /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
 Jun 10 11:39:06 cn03    mptsas0 unrecognized capability 0x3
 
 Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING:
 /scsi_vhci/disk@g5000c50009723937 (sd3):
 Jun 10 11:39:42 cn03    drive offline
 Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING:
 /scsi_vhci/disk@g5000c50009723937 (sd3):
 Jun 10 11:39:47 cn03    drive offline
 Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING:
 /scsi_vhci/disk@g5000c50009723937 (sd3):
 Jun 10 11:39:52 cn03    drive offline
 Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING:
 /scsi_vhci/disk@g5000c50009723937 (sd3):
 Jun 10 11:39:57 cn03    drive offline
> 
> 
> >
> > Hot spare will not help you here. The problem is not constrained to
> one
> > disk.
> > In fact, a hot spare may be the worst thing here because it can kick
> in
> > for the disk
> > complaining about a clogged expander or spurious resets.  This causes
> a
> > resilver
> > that reads from the actual broken disk, that causes more resets, that
> > kicks out another
> > disk that causes a resilver, and so on.
> >  -- richard
> >
> 
 So the warm spares could be "better" choice under this situation?
 BTW, in what condition, the scsi reset storm will happen?
 How can we be immune to this so as NOT to interrupt the file
 service?
> 
> 
> Thanks.
> Fred
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs global hot spares?

Reply via email to