Is there any flaw with the process below, customer asked:

Sun Cluser with each zpool composed of 1 Lun (yes, they have been 
advised to use redundant config instead). They do not export the pool to 
other host instead they use BCV  to make a mirror of the lun. They then 
split the mirror and import the lun/zpool onto a machine not even part 
of the cluster - backup server.

Most of the time, the import seems to work, but maybe about 10-15% of 
the time it panics the system with bad checksum . Customer do this 
procedure on  9 luns, 2 times a day. They have been doing the same thing 
with vxfs/VxVM for some time without any issue.

They were recommended to run scrub on a regular basis. I have also 
provided list of things to check that has a potential to cause checksum 
error:

1- Exporting LUNs to two different hosts and creating a zpool on it. I 
have seen it at one customer where one host had a ufs file system on the 
same LUN that is used by other hosts in its zpool.

2- Accessing the LUN by other means (dd of=/dev/..emcpower11c) that is 
under ZFScontrol can cause corrupted data.

3-  Mistakenly adding same device with different names in zpool. EMC 
Powerpath and Sun Multipath can have multiple device names pointing to 
the same device.

4- Importing device without exporting it first.

5- Bad Hardware, Storage or Controller bugs

6- ZFS is not cluster aware, it means one should use clustering software 
when sharing zpool across multiple hosts. Poor man cluster is not supported!

7- If LUNS exported to ZFS are RAID-5 types. See URL about RAID-5 issues:

            http://blogs.sun.com/bonwick/entry/raid_z
            http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt


Consider reading the article from Jeff Bonwick

     http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

Amer.

Analysis:


SolarisCAT(vmcore.1/10V)> stat

core file:      /cores/dir31/66015657/vmcore.1
user:           Cores User (cores:911)
release:        5.10 (64-bit)
version:        Generic_127127-11
machine:        sun4v
node name:      bansai
domain:         gov.edmonton.ab.ca
hw_provider:    Sun_Microsystems
system type:    SUNW,SPARC-Enterprise-T5220 (UltraSPARC-T2)
hostid:         84ac5f08
dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/dsk/c1t0d0s1(62.8G)
time of crash:  Sat Jul 19 22:42:55 MDT 2008 (core is 33 days old)
age of system:  1 days 6 hours 4 minutes 47.86 seconds
panic CPU:      56 (64 CPUs, 31.8G memory)
panic string:   ZFS: bad checksum (read on <unknown> off 0: zio 300743cec4
0 [L0 SPA space map] 1000L/a00P DVA[0]=<0:484a15000:a00> DVA[1]=<0:1df9
054a00:a00> fletcher4 lzjb BE contiguous birth=17010763 fill=1 cksum=8


!zio involved:

SolarisCAT(vmcore.1/10V)> sdump 300743cec40 zio_t io_spa,io_type,io_error
    io_spa = 0x300ee32c4c0
    io_type = 1 (ZIO_TYPE_READ)
    io_error = 0x32  <<<

!zpool that had blocks with checksum errors:

Block read on the file system that was using ZFS pool "sapcrp" had a 
checksum error. zio involved had an io_error:50 (errno: 50 (32 hex)).

#define EBADE 50 /* invalid exchange */

zio checksum error (ECKSUM) are reported as EBADE errno).

src code:
"
/*
   * We'll take the unused errno 'EBADE' (from the Convergent graveyard)
   * to indicate checksum errors.
   */
#define ECKSUM EBADE <<
"

Due to ZFS's checksum ability data read had checksum computed and 
compared it to the stored value, which should be the same if the data is 
good. Since the checksums are different, hence zfs concluded that data 
is corrupted!

If the storage pool had been setup in a ZFS redundant configuration
(mirroring or raidz), then ZFS could have gone to the mirror/parity,
read a good value and self corrected (heal) the other side of the mirror.

Unfortunately pool is configured in a non-redundant fashion as far
as ZFS is concerned. No reduntant configuration: mirror or raidz is used 
and checksum error left no good copy of the data and that resulted in 
panic. If there are multiple vdevs configured, ZFS can able to heal the 
data by reading a block with good checksum.

Zfs version 2 has metadata replication. Thus multiple vdevs with 
raidz(2) or mir
ror are more resilient to these failure because metadata can be replicated.

Also, raidz and raidz2 can have multiple vdev by creating stripe across 
raidz vdev groups. One can create 4 raidz groups out of 16 drive and 
then stripe accross four raidz groups. Each raidz group can handle one 
error or 1 disk failure, it  means 4 errors can be handled by 4 raidz 
group. With striping we are also increa
sing IO bandwith.

There is no replication provided with Hardware raid LUNS (EMC) because 
only one vdev exported to ZFS. It is recommeneded, if possible, to 
create multiple simple hardware LUNS and export it to ZFS and then 
configure ZFS to create raidz groups and strip across this group. Using 
this strategy, you will have a benefit of having hardware RAID boxes 
providing large caches for faster updates and ZFS ability to heal the 
data on-the-fly with multiple vdevs under its control.
 > 0x300ee32c4c0::spa
ADDR                 STATE NAME
00000300ee32c4c0    ACTIVE sapcrp
 > 0x300ee32c4c0::spa -v
ADDR                 STATE NAME
00000300ee32c4c0    ACTIVE sapcrp

     ADDR             STATE     AUX          DESCRIPTION
     000006005ebcdac0 HEALTHY   -
     0000030015400fc0 HEALTHY   - 
/dev/dsk/c6t6006048000018772084654
574F333445d0s0

 > 0x300ee32c4c0::spa -cv
ADDR                 STATE NAME
00000300ee32c4c0    ACTIVE sapcrp

     (none)

     ADDR             STATE     AUX          DESCRIPTION
     000006005ebcdac0 HEALTHY   -
     0000030015400fc0 HEALTHY   - 
/dev/dsk/c6t6006048000018772084654
574F333445d0s0

 > 0x300ee32c4c0::spa -e
ADDR                 STATE NAME
00000300ee32c4c0    ACTIVE sapcrp

     ADDR             STATE     AUX          DESCRIPTION
     000006005ebcdac0 HEALTHY   -

                        READ        WRITE         FREE        CLAIM 
    IOCTL
         OPS               0            0            0            0 
        0
         BYTES             0            0            0            0 
        0
         EREAD             0
         EWRITE            0
         ECKSUM            0

     0000030015400fc0 HEALTHY   - 
/dev/dsk/c6t6006048000018772084654
574F333445d0s0

                        READ        WRITE         FREE        CLAIM 
    IOCTL
         OPS            0x88         0x18            0            0 
        0
         BYTES      0x841c00      0x10a00            0            0 
        0
         EREAD             0
         EWRITE            0
         ECKSUM          0x4

Device:  c6t6006048000018772084654574F333445d0s0 -> 
../../devices/scsi_vhci/s
[EMAIL PROTECTED]

.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to