On Mar 21, 2010, at 11:03 AM, Frank Middleton wrote:
> On 03/15/10 01:01 PM, David Dyer-Bennet wrote:
> 
>> This sounds really bizarre.
> 
> Yes, it is. ButCR 6880994 is bizarre too.
Rolling back to a conversation with Frank last fall, here is the output
of fmdump which shows the single bit flip. Extra lines elided.

TIME                           CLASS
Oct 23 2009 14:53:01.525657508 ereport.fs.zfs.checksum
        class = ereport.fs.zfs.checksum
        pool = rpool
        vdev_guid = 0x509094f6dc795c97
        vdev_type = disk
        vdev_path = /dev/dsk/c3d0s0
        vdev_devid = id1,c...@amaxtor_6y080l0=y32he6xe/a
        parent_guid = 0x323cf9d672c3b05a
        parent_type = mirror
        zio_err = 50
        zio_offset = 0x50384800
        zio_size = 0x9800
        zio_objset = 0x29
        zio_object = 0x1a209
        zio_level = 0
        zio_blkid = 0x0
        cksum_expected = 0x4a027c11b3ba4cec 0xbf274565d5615b7b 
0x3ef5fe61b2ed672e 0xec8692f7fd33094a
        cksum_actual = 0x4a027c11b3ba4cec 0xbf274567d5615b7b 0x3ef5fe61b2ed672e 
0xec86a5b3fd33094a
        cksum_algorithm = fletcher2
        bad_ranges = 0x228 0x230
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x1
        bad_range_clears = 0x0
        bad_set_bits = 0x0 0x0 0x0 0x0 0x2 0x0 0x0 0x0
        bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0

Here we see that one bit was set, 0x2 when we expected 0x0.
Later that same day...

Oct 23 2009 14:53:01.525657152 ereport.fs.zfs.checksum
        class = ereport.fs.zfs.checksum
        pool = rpool
        pool_guid = 0x5062a7a7247652b1
        vdev_guid = 0x1181c8516c0dc9b0
        vdev_type = disk
        vdev_path = /dev/dsk/c3d1s0
        vdev_devid = id1,c...@awdc_wd800bb-00bsa0=wd-wma6s1025599/a
        parent_guid = 0x323cf9d672c3b05a
        parent_type = mirror
        zio_err = 50
        zio_offset = 0x50384800
        zio_size = 0x9800
        zio_objset = 0x29
        zio_object = 0x1a209
        zio_level = 0
        zio_blkid = 0x0
        cksum_expected = 0x4a027c11b3ba4cec 0xbf274565d5615b7b 
0x3ef5fe61b2ed672e 0xec8692f7fd33094a
        cksum_actual = 0x4a027c11b3ba4cec 0xbf274567d5615b7b 0x3ef5fe61b2ed672e 
0xec86a5b3fd33094a
        cksum_algorithm = fletcher2
        bad_ranges = 0x228 0x230
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x1
        bad_range_clears = 0x0
        bad_set_bits = 0x0 0x0 0x0 0x0 0x2 0x0 0x0 0x0
        bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0

So we see the exact same bit flipped (0x2 expecting 0x0) on two different
disks, /dev/dsk/c3d0s0 (Maxtor) and /dev/dsk/c3d1s0 (Western Digital), at 
the same zio offset and size.

I feel confident we are not seeing a b0rken drive here.  But something is
clearly amiss and we cannot rule out the processor, memory, or controller.
Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I'll go 
out
on a limb and speculate that there is something in the bit pattern for that 
file that intermittently triggers a bit flip on this system. I'll also 
speculate that
this error will not be reproducible on another system.

This sort of specific error analysis is possible after b125. See CR6867188
for more details.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to