more below... On Jun 16, 2011, at 2:27 AM, Fred Liu wrote:
> Fixing a typo in my last thread... > >> -----Original Message----- >> From: Fred Liu >> Sent: 星期四, 六月 16, 2011 17:22 >> To: 'Richard Elling' >> Cc: Jim Klimov; zfs-discuss@opensolaris.org >> Subject: RE: [zfs-discuss] zfs global hot spares? >> >>> This message is from the disk saying that it aborted a command. These >>> are >>> usually preceded by a reset, as shown here. What caused the reset >>> condition? >>> Was it actually target 11 or did target 11 get caught up in the reset >>> storm? >>> >> > It happed in the mid-night and nobody touched the file box. > I assume it is the transition status before the disk is *thoroughly* > damaged: > > Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS- > 8000-FD, TYPE: Fault, VER: 1, SEVERITY: > > Major > Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011 > Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890, > HOSTNAME: cn03 > Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0 > Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e > Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a > ZFS device exceeded > Jun 10 09:34:11 cn03 acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD for more information. > Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and > marked as faulted. An attempt > Jun 10 09:34:11 cn03 will be made to activate a hot spare if > available. > Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be > compromised. > Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the > bad device. zpool status -x output would be useful. These error reports do not include a pointer to the faulty device. fmadm can also give more info. > > After I rebooted it, I got: > Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release > 5.11 Version snv_134 64-bit > Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983- > 2010 Sun Microsystems, Inc. All rights > > reserved. > Jun 10 11:38:49 cn03 Use is subject to license terms. > Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features: > > 7f7fffff<sse4_2,sse4_1,ssse3,cpuid,mwait,tscp,cmp,cx16,sse3,nx,asysc,ht > t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d > > e,pge,mtrr,msr,tsc,lgpg> > > Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info] > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jun 10 11:39:06 cn03 mptsas0 unrecognized capability 0x3 > > Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:42 cn03 drive offline > Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:47 cn03 drive offline > Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:52 cn03 drive offline > Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:57 cn03 drive offline mpathadm can be used to determine the device paths for this disk. Notice how the disk is offline at multiple times. There is some sort of recovery going on here that continues to fail later. I call these "wounded soldiers" because they take a lot more care than a dead soldier. You would be better off if the drive completely died. >> >> >>> >>> Hot spare will not help you here. The problem is not constrained to >> one >>> disk. >>> In fact, a hot spare may be the worst thing here because it can kick >> in >>> for the disk >>> complaining about a clogged expander or spurious resets. This causes >> a >>> resilver >>> that reads from the actual broken disk, that causes more resets, that >>> kicks out another >>> disk that causes a resilver, and so on. >>> -- richard >>> >> > So the warm spares could be "better" choice under this situation? > BTW, in what condition, the scsi reset storm will happen? In my experience they start randomly and in some cases are not reproducible. > How can we be immune to this so as NOT to interrupt the file > service? Are you asking for fault tolerance? If so, then you need a fault tolerant system like a Tandem. If you are asking for a way to build a cost effective solution using commercial, off-the-shelf (COTS) components, then that is far beyond what can be easily said in a forum posting. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss