> Hey, Dennis -
>
> I can't help but wonder if the failure is a result of zfs itself finding
> some problems post restart...

Yes, yes, this is what I am feeling also, but I need to find the data also
and then I can sleep at night.  I am certain that ZFS does not just toss
out faults on a whim because there must be a deterministic, logical and
code based reason for those faults that occur *after* I go to init 3.

> Is there anything in your FMA logs?

Oh God yes,  brace yourself :-)

http://www.blastwave.org/dclarke/zfs/fmstat.txt

[ I edit the whitespace here for clarity ]
# fmstat
module      ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-diagnosis   0   0  0.0      2.7   0   0   3     0   4.2K   1.1K
cpumem-retire      0   0  0.0      0.2   0   0   0     0      0      0
disk-transport     0   0  0.0     45.7   0   0   0     0    40b      0
eft                0   0  0.0      0.7   0   0   0     0   1.2M      0
fabric-xlate       0   0  0.0      0.7   0   0   0     0      0      0
fmd-self-diagnosis 3   0  0.0      0.2   0   0   0     0      0      0
io-retire          0   0  0.0      0.2   0   0   0     0      0      0
snmp-trapgen       2   0  0.0      1.7   0   0   0     0    32b      0
sysevent-transport 0   0  0.0     75.4   0   0   0     0      0      0
syslog-msgs        2   0  0.0      1.4   0   0   0     0      0      0
zfs-diagnosis    296 252  2.0 236719.7  98   0   1     2   176b   144b
zfs-retire         4   0  0.0     27.4   0   0   0     0      0      0

 zfs-diagnosis svc_t=236719.7 ?

> for a summary and
>
>    fmdump
>
> for a summary of the related errors

http://www.blastwave.org/dclarke/zfs/fmdump.txt

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Dec 05 21:31:46.1069 aa3bfcfa-3261-cde4-d381-dae8abf296de ZFS-8000-D3
Mar 07 08:46:43.6238 4c8b199b-add1-c3fe-c8d6-9deeff91d9de ZFS-8000-FD
Mar 07 19:37:27.9819 b4824ce2-8f42-4392-c7bc-ab2e9d14b3b7 ZFS-8000-FD
Mar 07 19:37:29.8712 af726218-f1dc-6447-f581-cc6bb1411aa4 ZFS-8000-FD
Mar 07 19:37:30.2302 58c9e01f-8a80-61b0-ffea-ded63a9b076d ZFS-8000-FD
Mar 07 19:37:31.6410 3b0bfd9d-fc39-e7c2-c8bd-879cad9e5149 ZFS-8000-FD
Mar 10 19:37:08.8289 aa3bfcfa-3261-cde4-d381-dae8abf296de FMD-8000-4M
Repaired
Mar 23 23:47:36.9701 2b1aa4ae-60e4-c8ef-8eec-d92a18193e7a ZFS-8000-FD
Mar 24 01:29:00.1981 3780a2dd-7381-c053-e186-8112b463c2b7 ZFS-8000-FD
Mar 24 01:29:02.1649 146dad1d-f195-c2d6-c630-c1adcd58b288 ZFS-8000-FD

# fmdump -vu 3780a2dd-7381-c053-e186-8112b463c2b7
TIME                 UUID                                 SUNW-MSG-ID
Mar 24 01:29:00.1981 3780a2dd-7381-c053-e186-8112b463c2b7 ZFS-8000-FD
  100%  fault.fs.zfs.vdev.io

        Problem in: zfs://pool=fibre0/vdev=444604062b426970
           Affects: zfs://pool=fibre0/vdev=444604062b426970
               FRU: -
          Location: -

# fmdump -vu 146dad1d-f195-c2d6-c630-c1adcd58b288
TIME                 UUID                                 SUNW-MSG-ID
Mar 24 01:29:02.1649 146dad1d-f195-c2d6-c630-c1adcd58b288 ZFS-8000-FD
  100%  fault.fs.zfs.vdev.io

        Problem in: zfs://pool=fibre0/vdev=23e4d7426f941f52
           Affects: zfs://pool=fibre0/vdev=23e4d7426f941f52
               FRU: -
          Location: -

> will show more and more information about the error. Note that some of
> it might seem like rubbish. The important bits should be obvious though
> - things like the SUNW error message is (like ZFS-8000-D3), which can be
> pumped into
>
>    sun.com/msg

like so :

http://www.sun.com/msg/ZFS-8000-FD

or see http://www.blastwave.org/dclarke/zfs/ZFS-8000-FD.txt

        Article for Message ID:   ZFS-8000-FD

      Too many I/O errors on ZFS device

      Type

         Fault

      Severity

         Major

      Description

         The number of I/O errors associated with a ZFS device exceeded
         acceptable levels.

      Automated Response

         The device has been offlined and marked as faulted.
         An attempt will be made to activate a hot spare if available.

      Impact

         The fault tolerance of the pool may be affected.


Yep, I agree, that is what I saw.

> Note also that there should also be something interesting in the
> /var/adm/messages log to match and 'faulted' devices.
>
> You might also find an
>
>    fmdump -e

spooky long list of events :

TIME                 CLASS
Mar 23 23:47:28.5586 ereport.fs.zfs.io
Mar 23 23:47:28.5594 ereport.fs.zfs.io
Mar 23 23:47:28.5588 ereport.fs.zfs.io
Mar 23 23:47:28.5592 ereport.fs.zfs.io
Mar 23 23:47:28.5593 ereport.fs.zfs.io
.
.
.
Mar 23 23:47:28.5622 ereport.fs.zfs.io
Mar 23 23:47:28.5560 ereport.fs.zfs.io
Mar 23 23:47:28.5658 ereport.fs.zfs.io
Mar 23 23:48:41.5957 ereport.fs.zfs.io


   http://www.blastwave.org/dclarke/zfs/fmdump_e.txt

ouch, that is a nasty long list all in a few seconds.

> and
>
>    fmdump -eV

a very detailed verbose long list with such entries as

Mar 23 2009 23:48:41.595757900 ereport.fs.zfs.io
nvlist version: 0
        class = ereport.fs.zfs.io
        ena = 0x79c098255f400c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = zfs
                pool = 0xe3bb9417bc13c68d
                vdev = 0x444604062b426970
        (end detector)

        pool = fibre0
        pool_guid = 0xe3bb9417bc13c68d
        pool_context = 0
        pool_failmode = wait
        vdev_guid = 0x444604062b426970
        vdev_type = disk
        vdev_path = /dev/dsk/c2t17d0s0
        vdev_devid = id1,s...@n20000018625d599d/a
        parent_guid = 0x2cc7f46f722cfd61
        parent_type = mirror
        zio_err = 6
        zio_offset = 0xf97ebf400
        zio_size = 0x1400
        __ttl = 0x1
        __tod = 0x49c81fd9 0x23828b4c

> to be interesting - This is the *error* log as opposed to the *fault*
> log. (Every 'thing that goes wrong' is an error, only those that are
> diagnosed are considered a fault.)

I seem to have many things wrong. Many things. :-(

> Note that in all of these fm[dump|stat] commands, you are really only
> looking at the two sets of data. The errors - that is the telemetry
> incoming to FMA - and the faults. If you include a -e, you view the
> errors, otherwise, you are looking at the faults.
>
> By the way - sun.com/msg has a great PDF on it about the predictive self
> healing technologies in Solaris 10 and will offer more interesting
> information.

I think I have seen it before, it is very "marketing" focused.

>
> Would be interesting to see *why* ZFS / FMA is feeling the need to fault
> your devices.

It is a pile of information/data that still makes me wonder why also,
because I can easily detach and reattach those disks and be back in
business for months with no issue.

> I was interested to see on one of my boxes that I have actually had a
> *lot* of errors, which I'm now going to have to investigate... Looks
> like I have a dud rocket in my system... :)

I probably have a dud in there also, but ZFS refuses to FAULT it while
under normal day to day load. That is what is very odd.

>
> Oh - And I saw this:
>
> Nov 03 14:04:31.2783 ereport.fs.zfs.checksum
>
> Score one more for ZFS! This box has a measly 300GB mirrored, and I have
> already seen dud data. (heh... It's also got non-ecc memory... ;)

I don't think I have to worry about ECC memory on Sun hardware but I am
getting concerned about those disks that I have. I just am waiting for a
FAULT that will not go away so easily.

Thanks for the reply and the helpful pointers. Do people on maillists say
"thank you" anymore? Well, I just did.

Dennis


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to