On Wed, 23 Dec 2009 03:02:47 +0100, Mike Gerdts <mger...@gmail.com> wrote:
> I've been playing around with zones on NFS a bit and have run into > what looks to be a pretty bad snag - ZFS keeps seeing read and/or > checksum errors. This exists with S10u8 and OpenSolaris dev build > snv_129. This is likely a blocker for anything thinking of > implementing parts of Ed's Zones on Shared Storage: > > http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss > > The OpenSolaris example appears below. The order of events is: > > 1) Create a file on NFS, turn it into a zpool > 2) Configure a zone with the pool as zonepath > 3) Install the zone, verify that the pool is healthy > 4) Boot the zone, observe that the pool is sick [...] > r...@soltrain19# zoneadm -z osol boot > > r...@soltrain19# zpool status osol > pool: osol > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > osol DEGRADED 0 0 0 > /mnt/osolzone/root DEGRADED 0 0 117 too many errors > > errors: No known data errors Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit the same during my slightely different testing, where I'm NFS mounting an entire, pre-existing remote file living in the zpool on the NFS server and use that to create a zpool and install zones into it. I've filed today: 6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent reason here's the relevant piece worth investigating out of it (leaving out the actual setup etc..) as in your case, creating the zpool and installing the zone into it still gives a healthy zpool, but immediately after booting the zone, the zpool served over NFS accumulated CHKSUM errors. of particular interest are the 'cksum_actual' values as reported by Mike for his test case here: http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html if compared to the 'chksum_actual' values I got in the fmdump error output on my test case/system: note, the NFS servers zpool that is serving and sharing the file we use is healthy. zone halted now on my test system, and checking fmdump: osoldev.batschul./export/home/batschul.=> fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail 2 cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 0x7cd81ca72df5ccc0 2 cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 0x3d2827dd7ee4f21 6 cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 0x983ddbb8c4590e40 *A 6 cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 0x89715e34fbf9cdc0 *B 7 cksum_actual = 0x0 0x0 0x0 0x0 *C 11 cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 0x280934efa6d20f40 *D 14 cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 0x7e0aef335f0c7f00 *E 17 cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 0xd4f1025a8e66fe00 *F 20 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 0x7f84b11b3fc7f80 *G 25 cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 0x82804bc6ebcfc0 osoldev.root./export/home/batschul.=> zpool status -v pool: nfszone state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM nfszone DEGRADED 0 0 0 /nfszone DEGRADED 0 0 462 too many errors errors: No known data errors ========================================================================== now compare this with Mike's error output as posted here: http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html # fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail 2 cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 0x290cbce13fc59dce *D 3 cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 0x7e0aef335f0c7f00 *E 3 cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 0xd4f1025a8e66fe00 *B 4 cksum_actual = 0x0 0x0 0x0 0x0 4 cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 0x330107da7c4bcec0 5 cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 0x4e0b3a8747b8a8 *C 6 cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 0x280934efa6d20f40 *A 6 cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 0x89715e34fbf9cdc0 *F 16 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 0x7f84b11b3fc7f80 *G 48 cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 0x82804bc6ebcfc0 and observe that the values in 'chksum_actual' causing our CHKSUM pool errors eventually because of missmatching with what had been expected are the SAME ! for 2 totally different client systems and 2 different NFS servers (mine vrs. Mike's), see the entries marked with *A to *G. This just can't be an accident, there must be some coincidence and thus there's a good chance that these CHKSUM errors must have a common source, either in ZFS or in NFS ? cheers frankB _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss