On Fri, Jan 8, 2010 at 5:28 AM, Frank Batschulat (Home) <frank.batschu...@sun.com> wrote: [snip] > Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit > the same during my slightely different testing, where I'm NFS mounting an > entire, pre-existing remote file living in the zpool on the NFS server and use > that to create a zpool and install zones into it.
What does your overall setup look like? Mine is: T5220 + Sun System Firmware 7.2.4.f 2009/11/05 18:21 Primary LDom Solaris 10u8 Logical Domains Manager 1.2,REV=2009.06.25.09.48 + 142840-03 Guest Domain 4 vcpus + 15 GB memory OpenSolaris snv_130 (this is where the problem is observed) I've seen similar errors on Solaris 10 in the primary domain and on a M4000. Unfortunately Solaris 10 doesn't show the checksums in the ereport. There I noticed a mixture between read errors and checksum errors - and lots more of them. This could be because the S10 zone was a full root SUNWCXall compared to the much smaller default ipkg branded zone. On the primary domain running Solaris 10... (this command was run some time ago) primary-domain# zpool status myzone pool: myzone state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM myzone DEGRADED 0 0 0 /foo/20g DEGRADED 4.53K 0 671 too many errors errors: No known data errors (this was run today, many days after previous command) primary-domain# fmdump -eV | egrep zio_err | uniq -c | head 1 zio_err = 5 1 zio_err = 50 1 zio_err = 5 1 zio_err = 50 1 zio_err = 5 1 zio_err = 50 2 zio_err = 5 1 zio_err = 50 3 zio_err = 5 1 zio_err = 50 Note that even though I had thousands of read errors the zone worked just fine. I would have never known (suspected?) there was a problem if I hadn't run "zpool status" or the various FMA commands. > I've filed today: > > 6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent > reason Thanks. I'll open a support call to help get some funding on it... > here's the relevant piece worth investigating out of it (leaving out the > actual setup etc..) > as in your case, creating the zpool and installing the zone into it still > gives > a healthy zpool, but immediately after booting the zone, the zpool served > over NFS > accumulated CHKSUM errors. > > of particular interest are the 'cksum_actual' values as reported by Mike for > his > test case here: > > http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html > > if compared to the 'chksum_actual' values I got in the fmdump error output on > my test case/system: > > note, the NFS servers zpool that is serving and sharing the file we use is > healthy. > > zone halted now on my test system, and checking fmdump: > > osoldev.batschul./export/home/batschul.=> fmdump -eV | grep cksum_actual | > sort | uniq -c | sort -n | tail > 2 cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 > 0x7cd81ca72df5ccc0 > 2 cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 > 0x3d2827dd7ee4f21 > 6 cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 > 0x983ddbb8c4590e40 > *A 6 cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 > 0x89715e34fbf9cdc0 > *B 7 cksum_actual = 0x0 0x0 0x0 0x0 > *C 11 cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 > 0x280934efa6d20f40 > *D 14 cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 > 0x7e0aef335f0c7f00 > *E 17 cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 > 0xd4f1025a8e66fe00 > *F 20 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 > 0x7f84b11b3fc7f80 > *G 25 cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 > 0x82804bc6ebcfc0 > > osoldev.root./export/home/batschul.=> zpool status -v > pool: nfszone > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > nfszone DEGRADED 0 0 0 > /nfszone DEGRADED 0 0 462 too many errors > > errors: No known data errors > > ========================================================================== > > now compare this with Mike's error output as posted here: > > http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html > > # fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail > > 2 cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 > 0x290cbce13fc59dce > *D 3 cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 > 0x7e0aef335f0c7f00 > *E 3 cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 > 0xd4f1025a8e66fe00 > *B 4 cksum_actual = 0x0 0x0 0x0 0x0 > 4 cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 > 0x330107da7c4bcec0 > 5 cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 > 0x4e0b3a8747b8a8 > *C 6 cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 > 0x280934efa6d20f40 > *A 6 cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 > 0x89715e34fbf9cdc0 > *F 16 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 > 0x7f84b11b3fc7f80 > *G 48 cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 > 0x82804bc6ebcfc0 > > and observe that the values in 'chksum_actual' causing our CHKSUM pool errors > eventually > because of missmatching with what had been expected are the SAME ! for 2 > totally > different client systems and 2 different NFS servers (mine vrs. Mike's), > see the entries marked with *A to *G. > > This just can't be an accident, there must be some coincidence and thus > there's a good chance > that these CHKSUM errors must have a common source, either in ZFS or in NFS ? You saved me so much time with this observation. Thank you! -- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss