> -----Original Message-----
> From: Fred Liu
> Sent: 星期四, 六月 16, 2011 17:28
> To: Fred Liu; 'Richard Elling'
> Cc: 'Jim Klimov'; 'zfs-discuss@opensolaris.org'
> Subject: RE: [zfs-discuss] zfs global hot spares?
>
> Fixing a typo in my last thread...
>
> > -----Original Message-----
> > From: Fred Liu
> > Sent: 星期四, 六月 16, 2011 17:22
> > To: 'Richard Elling'
> > Cc: Jim Klimov; zfs-discuss@opensolaris.org
> > Subject: RE: [zfs-discuss] zfs global hot spares?
> >
> > > This message is from the disk saying that it aborted a command.
> These
> > > are
> > > usually preceded by a reset, as shown here. What caused the reset
> > > condition?
> > > Was it actually target 11 or did target 11 get caught up in the
> reset
> > > storm?
> > >
> >
> It happed in the mid-night and nobody touched the file box.
> I assume it is the transition status before the disk is *thoroughly*
> damaged:
>
> Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
> 8000-FD, TYPE: Fault, VER: 1, SEVERITY:
>
> Major
> Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011
> Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890,
> HOSTNAME: cn03
> Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0
> Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e
> Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a
> ZFS device exceeded
> Jun 10 09:34:11 cn03 acceptable levels. Refer to
> http://sun.com/msg/ZFS-8000-FD for more information.
> Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and
> marked as faulted. An attempt
> Jun 10 09:34:11 cn03 will be made to activate a hot spare if
> available.
> Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be
> compromised.
> Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the
> bad device.
>
> After I rebooted it, I got:
> Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release
> 5.11 Version snv_134 64-bit
> Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983-
> 2010 Sun Microsystems, Inc. All rights
>
> reserved.
> Jun 10 11:38:49 cn03 Use is subject to license terms.
> Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features:
>
>
> 7f7fffff<sse4_2,sse4_1,ssse3,cpuid,mwait,tscp,cmp,cx16,sse3,nx,asysc,ht
> t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d
>
> e,pge,mtrr,msr,tsc,lgpg>
>
> Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info]
> /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
> Jun 10 11:39:06 cn03 mptsas0 unrecognized capability 0x3
>
> Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:42 cn03 drive offline
> Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:47 cn03 drive offline
> Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:52 cn03 drive offline
> Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:57 cn03 drive offline
> >
> >
> > >
> > > Hot spare will not help you here. The problem is not constrained to
> > one
> > > disk.
> > > In fact, a hot spare may be the worst thing here because it can
> kick
> > in
> > > for the disk
> > > complaining about a clogged expander or spurious resets. This
> causes
> > a
> > > resilver
> > > that reads from the actual broken disk, that causes more resets,
> that
> > > kicks out another
> > > disk that causes a resilver, and so on.
> > > -- richard
> > >
> >
> So the warm spares could be "better" choice under this situation?
> BTW, in what condition, the scsi reset storm will happen?
> How can we be immune to this so as NOT to interrupt the file
> service?
> >
> >
> > Thanks.
> > Fred
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss