> Hey, Dennis - > > I can't help but wonder if the failure is a result of zfs itself finding > some problems post restart...
Yes, yes, this is what I am feeling also, but I need to find the data also and then I can sleep at night. I am certain that ZFS does not just toss out faults on a whim because there must be a deterministic, logical and code based reason for those faults that occur *after* I go to init 3. > Is there anything in your FMA logs? Oh God yes, brace yourself :-) http://www.blastwave.org/dclarke/zfs/fmstat.txt [ I edit the whitespace here for clarity ] # fmstat module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-diagnosis 0 0 0.0 2.7 0 0 3 0 4.2K 1.1K cpumem-retire 0 0 0.0 0.2 0 0 0 0 0 0 disk-transport 0 0 0.0 45.7 0 0 0 0 40b 0 eft 0 0 0.0 0.7 0 0 0 0 1.2M 0 fabric-xlate 0 0 0.0 0.7 0 0 0 0 0 0 fmd-self-diagnosis 3 0 0.0 0.2 0 0 0 0 0 0 io-retire 0 0 0.0 0.2 0 0 0 0 0 0 snmp-trapgen 2 0 0.0 1.7 0 0 0 0 32b 0 sysevent-transport 0 0 0.0 75.4 0 0 0 0 0 0 syslog-msgs 2 0 0.0 1.4 0 0 0 0 0 0 zfs-diagnosis 296 252 2.0 236719.7 98 0 1 2 176b 144b zfs-retire 4 0 0.0 27.4 0 0 0 0 0 0 zfs-diagnosis svc_t=236719.7 ? > for a summary and > > fmdump > > for a summary of the related errors http://www.blastwave.org/dclarke/zfs/fmdump.txt # fmdump TIME UUID SUNW-MSG-ID Dec 05 21:31:46.1069 aa3bfcfa-3261-cde4-d381-dae8abf296de ZFS-8000-D3 Mar 07 08:46:43.6238 4c8b199b-add1-c3fe-c8d6-9deeff91d9de ZFS-8000-FD Mar 07 19:37:27.9819 b4824ce2-8f42-4392-c7bc-ab2e9d14b3b7 ZFS-8000-FD Mar 07 19:37:29.8712 af726218-f1dc-6447-f581-cc6bb1411aa4 ZFS-8000-FD Mar 07 19:37:30.2302 58c9e01f-8a80-61b0-ffea-ded63a9b076d ZFS-8000-FD Mar 07 19:37:31.6410 3b0bfd9d-fc39-e7c2-c8bd-879cad9e5149 ZFS-8000-FD Mar 10 19:37:08.8289 aa3bfcfa-3261-cde4-d381-dae8abf296de FMD-8000-4M Repaired Mar 23 23:47:36.9701 2b1aa4ae-60e4-c8ef-8eec-d92a18193e7a ZFS-8000-FD Mar 24 01:29:00.1981 3780a2dd-7381-c053-e186-8112b463c2b7 ZFS-8000-FD Mar 24 01:29:02.1649 146dad1d-f195-c2d6-c630-c1adcd58b288 ZFS-8000-FD # fmdump -vu 3780a2dd-7381-c053-e186-8112b463c2b7 TIME UUID SUNW-MSG-ID Mar 24 01:29:00.1981 3780a2dd-7381-c053-e186-8112b463c2b7 ZFS-8000-FD 100% fault.fs.zfs.vdev.io Problem in: zfs://pool=fibre0/vdev=444604062b426970 Affects: zfs://pool=fibre0/vdev=444604062b426970 FRU: - Location: - # fmdump -vu 146dad1d-f195-c2d6-c630-c1adcd58b288 TIME UUID SUNW-MSG-ID Mar 24 01:29:02.1649 146dad1d-f195-c2d6-c630-c1adcd58b288 ZFS-8000-FD 100% fault.fs.zfs.vdev.io Problem in: zfs://pool=fibre0/vdev=23e4d7426f941f52 Affects: zfs://pool=fibre0/vdev=23e4d7426f941f52 FRU: - Location: - > will show more and more information about the error. Note that some of > it might seem like rubbish. The important bits should be obvious though > - things like the SUNW error message is (like ZFS-8000-D3), which can be > pumped into > > sun.com/msg like so : http://www.sun.com/msg/ZFS-8000-FD or see http://www.blastwave.org/dclarke/zfs/ZFS-8000-FD.txt Article for Message ID: ZFS-8000-FD Too many I/O errors on ZFS device Type Fault Severity Major Description The number of I/O errors associated with a ZFS device exceeded acceptable levels. Automated Response The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact The fault tolerance of the pool may be affected. Yep, I agree, that is what I saw. > Note also that there should also be something interesting in the > /var/adm/messages log to match and 'faulted' devices. > > You might also find an > > fmdump -e spooky long list of events : TIME CLASS Mar 23 23:47:28.5586 ereport.fs.zfs.io Mar 23 23:47:28.5594 ereport.fs.zfs.io Mar 23 23:47:28.5588 ereport.fs.zfs.io Mar 23 23:47:28.5592 ereport.fs.zfs.io Mar 23 23:47:28.5593 ereport.fs.zfs.io . . . Mar 23 23:47:28.5622 ereport.fs.zfs.io Mar 23 23:47:28.5560 ereport.fs.zfs.io Mar 23 23:47:28.5658 ereport.fs.zfs.io Mar 23 23:48:41.5957 ereport.fs.zfs.io http://www.blastwave.org/dclarke/zfs/fmdump_e.txt ouch, that is a nasty long list all in a few seconds. > and > > fmdump -eV a very detailed verbose long list with such entries as Mar 23 2009 23:48:41.595757900 ereport.fs.zfs.io nvlist version: 0 class = ereport.fs.zfs.io ena = 0x79c098255f400c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xe3bb9417bc13c68d vdev = 0x444604062b426970 (end detector) pool = fibre0 pool_guid = 0xe3bb9417bc13c68d pool_context = 0 pool_failmode = wait vdev_guid = 0x444604062b426970 vdev_type = disk vdev_path = /dev/dsk/c2t17d0s0 vdev_devid = id1,s...@n20000018625d599d/a parent_guid = 0x2cc7f46f722cfd61 parent_type = mirror zio_err = 6 zio_offset = 0xf97ebf400 zio_size = 0x1400 __ttl = 0x1 __tod = 0x49c81fd9 0x23828b4c > to be interesting - This is the *error* log as opposed to the *fault* > log. (Every 'thing that goes wrong' is an error, only those that are > diagnosed are considered a fault.) I seem to have many things wrong. Many things. :-( > Note that in all of these fm[dump|stat] commands, you are really only > looking at the two sets of data. The errors - that is the telemetry > incoming to FMA - and the faults. If you include a -e, you view the > errors, otherwise, you are looking at the faults. > > By the way - sun.com/msg has a great PDF on it about the predictive self > healing technologies in Solaris 10 and will offer more interesting > information. I think I have seen it before, it is very "marketing" focused. > > Would be interesting to see *why* ZFS / FMA is feeling the need to fault > your devices. It is a pile of information/data that still makes me wonder why also, because I can easily detach and reattach those disks and be back in business for months with no issue. > I was interested to see on one of my boxes that I have actually had a > *lot* of errors, which I'm now going to have to investigate... Looks > like I have a dud rocket in my system... :) I probably have a dud in there also, but ZFS refuses to FAULT it while under normal day to day load. That is what is very odd. > > Oh - And I saw this: > > Nov 03 14:04:31.2783 ereport.fs.zfs.checksum > > Score one more for ZFS! This box has a measly 300GB mirrored, and I have > already seen dud data. (heh... It's also got non-ecc memory... ;) I don't think I have to worry about ECC memory on Sun hardware but I am getting concerned about those disks that I have. I just am waiting for a FAULT that will not go away so easily. Thanks for the reply and the helpful pointers. Do people on maillists say "thank you" anymore? Well, I just did. Dennis _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss