Fixing a typo in my last thread...
> -----Original Message----- > From: Fred Liu > Sent: 星期四, 六月 16, 2011 17:22 > To: 'Richard Elling' > Cc: Jim Klimov; zfs-discuss@opensolaris.org > Subject: RE: [zfs-discuss] zfs global hot spares? > > > This message is from the disk saying that it aborted a command. These > > are > > usually preceded by a reset, as shown here. What caused the reset > > condition? > > Was it actually target 11 or did target 11 get caught up in the reset > > storm? > > > It happed in the mid-night and nobody touched the file box. I assume it is the transition status before the disk is *thoroughly* damaged: Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS- 8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011 Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890, HOSTNAME: cn03 Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0 Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a ZFS device exceeded Jun 10 09:34:11 cn03 acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Jun 10 09:34:11 cn03 will be made to activate a hot spare if available. Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be compromised. Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the bad device. After I rebooted it, I got: Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_134 64-bit Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983- 2010 Sun Microsystems, Inc. All rights reserved. Jun 10 11:38:49 cn03 Use is subject to license terms. Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features: 7f7fffff<sse4_2,sse4_1,ssse3,cpuid,mwait,tscp,cmp,cx16,sse3,nx,asysc,ht t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d e,pge,mtrr,msr,tsc,lgpg> Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info] /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jun 10 11:39:06 cn03 mptsas0 unrecognized capability 0x3 Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50009723937 (sd3): Jun 10 11:39:42 cn03 drive offline Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50009723937 (sd3): Jun 10 11:39:47 cn03 drive offline Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50009723937 (sd3): Jun 10 11:39:52 cn03 drive offline Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50009723937 (sd3): Jun 10 11:39:57 cn03 drive offline > > > > > > Hot spare will not help you here. The problem is not constrained to > one > > disk. > > In fact, a hot spare may be the worst thing here because it can kick > in > > for the disk > > complaining about a clogged expander or spurious resets. This causes > a > > resilver > > that reads from the actual broken disk, that causes more resets, that > > kicks out another > > disk that causes a resilver, and so on. > > -- richard > > > So the warm spares could be "better" choice under this situation? BTW, in what condition, the scsi reset storm will happen? How can we be immune to this so as NOT to interrupt the file service? > > > Thanks. > Fred _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss